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We consider large-scale studies in which thousands of significance 
tests are performed simultaneously. In some of these studies, the mul¬ 
tiple testing procedure can be severely biased by latent confounding 
factors such as batch effects and unmeasured covariates that corre¬ 
late with both primary variable(s) of interest (e.g. treatment variable, 
phenotype) and the outcome. Over the past decade, many statisti¬ 
cal methods have been proposed to adjust for the confounders in 
hypothesis testing. We unify these methods in the same framework, 
generalize them to include multiple primary variables and multiple 
nuisance variables, and analyze their statistical properties. In partic¬ 
ular, we provide theoretical guarantees for RUV-4 [26] and LEAPP 
[60], which correspond to two different identification conditions in 
the framework: the first requires a set of “negative controls” that are 
known a priori to follow the null distribution; the second requires the 
true non-nulls to be sparse. Two different estimators which are based 
on RUV-4 and LEAPP are then applied to these two scenarios. We 
show that if the confounding factors are strong, the resulting estima¬ 
tors can be asymptotically as powerful as the oracle estimator which 
observes the latent confounding factors. For hypothesis testing, we 
show the asymptotic 2 -tests based on the estimators can control the 
type I error. Numerical experiments show that the false discovery 
rate is also controlled by the Benjamini-Hochberg procedure when 
the sample size is reasonably large. 


1. Introduction. Multiple hypothesis testing has become an impor¬ 
tant statistical problem for many scientific fields, where tens of thousands 
of tests are typically performed simultaneously. Traditionally the tests are 
assumed to be independent of each other, so the false discovery rate (FDR) 
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can be easily controlled by e.g., the Benjamini-Hochberg procedure [9]. Re¬ 
cent years have witnessed an extensive investigation of multiple hypothesis 
testing under dependence, ranging from permutation tests [34, 61], posi¬ 
tive dependence [10], weak dependence [15, 57], accuracy calculation under 
dependence [19, 45] to mixture models [20, 59] and latent factor models 
[21, 22, 36]. Many of these works provide theoretical guarantees for FDR 
control under the assumption that the individual test statistics are valid 
and may even be correlated. 

In this paper, we investigate a more challenging setting. The test statis¬ 
tics may be correlated with each other due to latent factors and those latent 
factors may also be correlated with the variable of interest. As a result, 
the test statistics are not only correlated but are also confounded. We use 
the phrase “confounding” to emphasize that these latent factors can signifi¬ 
cantly bias the individual p-values, therefore this problem is fundamentally 
different from the literature in the previous paragraph and poses an im¬ 
mediate threat to the reproducibility of the discoveries. Many confounder 
adjustment methods have already been proposed for multiple testing over 
the last decade [26, 39, 50, 60]. Our goal is to unify these methods in the 
same framework and study their statistical properties. 

The confounding problem. We start with three real data examples to 
illustrate the confounding problem. The first microarray data (Figure la) 
is used by Singh et al. [56] to identify candidate genes associated with a 
chronic lung disease called emphysema. The second (Figures lb and Id) 
and third (Figure Ic) data are used by Gagnon-Bartsch, Jacob and Speed 
[26] to study the performance of various confounder adjustment methods. 
For each dataset, we plot the histogram of t-statistics of a simple linear 
model that regresses the gene expression on the variable of interest (disease 
status for the first and gender for the second and third datasets). These 
statistics are commonly used in genome-wide association studies (GWAS) 
to hnd potentially interesting genes. See Section 6.2.1 for more detail of 
these datasets. 

The histograms of t-statistics in Figure 1 clearly depart from the approxi¬ 
mate theoretical null distribution N(0,1). The bulk of the test statistics can 
be skewed (Figures la and lb), overdispersed (Figure la), underdispersed 
(Figures lb and Id), or noncentered (Figure Ic). In these cases, neither 
the theoretical null N(0,1), nor even the empirical null as shown in the 
histograms, look appropriate for measuring significance. Schwartzman [53] 
proved that a largely overdispersed histogram like Figure la cannot be ex¬ 
plained by correlation alone, and is possibly due to the presence of confound¬ 
ing factors. For a sneak preview of the confounder adjustment, the reader 
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can find the histograms after our confounder adjustment in Figure 3 at the 
end of this paper. The p-values of our test of confounding (Section 3.3.2) in 
Table 2 indicate that all the three datasets suffer from confounding latent 
factors. 

Other common sources of confounding in gene expression profiling in¬ 
clude systematic ancestry differences [50], environmental changes [23, 28] 
and surgical manipulation [42]. See Lazar et al. [37] for a survey. In many 
studies, especially for observational clinical research and human expression 
data, the latent factors, either genetic or technical, are confounded with pri¬ 
mary variables of interest due to the observational nature of the studies and 
heterogeneity of samples [51, 52]. Similar confounding problems also occur 
in other high-dimensional datasets such as brain imaging [54] and metabo- 
nomics [16]. 

Previous methods. As early as Alter, Brown and Botstein [1], principal 
component analysis has been suggested to estimate the confounding factors. 
This approach can work reasonably well if the confounders clearly stand out. 
For example, in population genetics, Price et al. [50] proposed a procedure 
called EIGENSTRAT that removes the largest few principal components 
from their SNP genotype data, claiming they closely resemble the ancestry 
difference. In gene expression data, however, it is often unrealistic to assume 
they always represent the confounding factors. The largest principal com¬ 
ponent may also correlate with the primary effects of interest. Therefore, 
directly removing them can result in loss of statistical power. 

More recently, an emerging literature considers the confounding problem 
in similar statistical settings and a variety of methods have been proposed for 
confounder adjustment [25, 26, 27, 38, 39, 60]. These statistical methods are 
shown to work better than the EIGENSTRAT procedure for gene expression 
data. However, little is known about their theoretical properties. Indeed, 
the authors did not focus on model identifiability and rely on impressive 
heuristic calculations to derive their estimators. In this paper, we address 
the identifiability problem, rederive the estimators in [26, 60] in a more 
principled way and provide theoretical guarantees for them. 

Before describing the modeling framework, we want to clarify our termi¬ 
nology. The confounding factors or confounders considered in the present 
paper are referred to by different names in the literature, such as “surrogate 
variables” [38], “latent factors” [25], “batch effects” [40], “unwanted varia¬ 
tion” [27] and “latent effects” [60]. We believe they are all describing the 
same phenomenon: that there exist some unobserved variables that correlate 
with both the primary variable(s) of interest and the outcome variables (e.g. 
gene expression). This problem is generally known as confounding [24, 33]. 
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(a) Dataset 1. 
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(b) Dataset 2. 



(c) Dataset 3. 


(d) Dataset 2 (batch correction). 


Fig 1: Dataset 1 is the COPD dataset [56]. Dataset 2 and 3 are from Gagnon- 
Bartsch, Jacob and Speed [26]. Histograms of regression t-statistics in three mi¬ 
croarray studies show clear departure from the theoretical null distribution N(0,1). 
The mean and standard deviation of the normal approximation are obtained from 
the median and median absolute deviation of the statistics. See Section 6.2 for the 
empirical distributions after confounder adjustment. 
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A famous example is Simpson’s paradox. The term “confounding” has mul¬ 
tiple meanings in the literature. We use the meaning from [29]: “a mixing of 
effects of extraneous factors (called confounders) with the effect of interest”. 

Statistical model of confounding. Most of the confounder adjustment 
methods mentioned above are built around the following model 

(1.1) Y = XI3'^ + ZT'^ + E 

Here Y is a n x p observed matrix (e.g. gene expression); X is an n x 1 
observed primary variable of interest (e.g. treatment-control, phenotype, 
health trait); Z’ is an n x r latent confounding factor matrix; E is often 
assumed to be a Gaussian noise matrix. The p x 1 vector /3 contains the 
primary effects we want to estimate. 

Model (1.1) is very general for multiple testing dependence. Leek and 
Storey [39, Proposition 1] suggest that multiple hypothesis tests based on 
linear regression can always be represented by (1.1) using sufficiently many 
factors. However, equation (1.1) itself is not enough to model confounded 
tests. To elucidate the concept of confounding, we need to characterize the 
relationship between the latent variables Z and the primary variable X. To 
be more specific, we assume the regression of Z on X also follows a linear 
relationship 

(1.2) Z = Xcx^ + W, 

where W is a n x r random noise matrix independent of X and E and 
the r X 1 vector o: characterizes the extent of confounding in this data. By 
plugging (1.2) in (1.1), the linear regression of on X gives an unbiased 
estimate of the marginal effects 

(1.3) T = (3 + To: 

When a 7^ 0, T is not the same as f3 by (1.3). In this case, the data {X, Y) 
are confounded by Z. Since the confounding factors Z are data artifacts in 
this model, the statistical inference of f3 is much more interesting than that 
of T. See Section 5.2 for more discussion on the marginal and the direct 
effects. 

Following LEAPP [60], we use a QR decomposition to decouple the esti¬ 
mation of r from f3. The inference procedure splits into the following two 
steps: 

Step 1 By regressing out X in (1.1), E is the loading matrix in a factor 
analysis model and can be efficiently estimated by maximum likeli¬ 
hood. 


6 


J. WANG ET AL. 


Step 2 Equation (1.3) can be viewed as a linear regression of the marginal 
effects r on the factor loadings F. To estimate a and (3, we replace r 
by its observed value and F by its estimate in Step 1. 

As mentioned before, other existing confounder adjustment methods in¬ 
cluding SVA [39] and RUV-4 [26] can be unified in this two-step statistical 
procedure. See Section 5.3 for a detailed discussion of these methods. 

Contributions. Our first contribution in Section 2 is to establish iden- 
tihability for the confounded multiple testing model. In the first step of 
estimating factor loadings F, identifiability is well studied in classical multi¬ 
variate statistics. However, the second step of estimating the effects /3 is not 
identihable without additional constraints. We consider two different suf¬ 
ficient conditions for global identifiability. The hrst condition assumes the 
researcher has a “negative control” variable set for which there should be 
no direct effect. This negative control set often serves as a quality control 
precaution in microarray studies [27], but they can also be used to adjust 
for the confounding factors. The second identification condition assumes at 
least half of the true effects are zero, i.e., the true alternative hypotheses are 
sparse. These two identification conditions correspond to the approaches of 
RUV-4 [26] and LEAPP [60], respectively. 

Our second contribution in Section 3 is to derive valid and efficient sta¬ 
tistical methods under these identification conditions in the second step. In 
order to estimate the effects, it is essential to estimate the coefficients a 
relating the primary variable to the confounders. Under the two different 
identihcation conditions, we study two different regression methods which 
are analytically tractable and equally well performing alternatives to RUV- 
4 and LEAPP. For the negative control (NC) scenario, and are 
obtained by generalized least squares using the negative controls. For the 
sparsity scenario, 6;^^ and are obtained by using a simpler and more 
analytically tractable robust regression (RR) than the one used in LEAPP. 

When the factors are strong (as large as the noise magnitude), for both 
scenarios we hnd that the resulting estimators of [3 are asymptotically as 
efficient as the oracle estimator which is allowed to observe the confounding 
factors. It is surprising that no essential loss of efficiency is incurred by 
searching for the confounding variables. Our asymptotic analysis relies on 
some recent theoretical results for factor analysis due to Bai and Li [3]. 
The asymptotic regime we consider has both n, the number of observations, 
and p, the number of outcome variables (e.g. genes), going to inhnity. The 
most important condition that we require for asymptotic efficiency in the 
negative control scenario is that the number of negative controls increases 
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to infinity; in the sparsity scenario, we need the Li norm of the effects to 
satisfy ||/3||n/n/p —>• 0. The fact that p ^ n in many multiple hypothesis 
testing problems plays an important role in these asymptotics. 

Next in Section 3, we show that the asymptotic z-statistics based on the 
efficient estimators of (3 can control the type I error. This is not a trivial 
corollary from the asymptotic distribution of the test statistics because the 
size of (3 is growing and the z-statistics are weakly correlated. Proving FDR 
control is more technically demanding and is beyond the scope of this paper. 
Instead, we use numerical simulations to study the empirical performance 
(including FDR) of our tests. We also give a significance test of confounding 
(null hypothesis ct = 0) in Section 3. This test can help the experimenter to 
determine if there is any hidden confounder in the design or the experiment 
process. 

In Section 4, we generalize the confounder adjustment model to include 
multiple primary variables and multiple nuisance covariates. We show the 
statistical methods and theory for the single primary variable regression 
problem (I.l) can be smoothly extended to the multiple regression problem. 

Outline. Section 2 introduces the model and describes the two identifica¬ 
tion conditions. Section 3 studies the statistical inference. Section 4 extends 
our framework to a linear model with multiple primary variables and multi¬ 
ple known controlling covariates. Section 5 discusses our theoretical analysis 
in the context of previous literature, including the existing procedures for 
debiasing the confounders and existing theoretical results of multiple hy¬ 
pothesis testing under dependence (but no confounding). Section 6 studies 
the empirical behavior of our estimators in simulations and real data exam¬ 
ples. Technical proofs of the results are provided in Supplement [64]. 

To help the reader follow this paper and compare our methods and theory 
with existing approaches. Table 1 summarizes some related publications with 
more detailed discussion in Section 5. 

Notation. Throughout the article, we use bold upper-case letters for 
matrices and lower-case letters for vectors. We use Latin letters for random 
variables and Greek letters for model parameters. Subscripts of matrices are 
used to indicate row(s) whenever possible. For example, if C is a set of indices, 
then is the corresponding rows of F. The Lq norm of a vector is defined 
as the number of nonzero entries: ||/3||o = |{1 < i < P : 13j 7^ 0}|. A random 
matrix E G is said to follow a matrix normal distribution with mean 

M £ j^nxp, covariance U £ and column covariance V £ 

abbreviated as E ^ MN{M,U,V), if the vectorization of E by column 
follows the multivariate normal distribution vec{E) ~ N(vec(iW), V ® U). 
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Noise conditional on latent factors 

Independent Correlated 

Positive or weak 
dependence 

Benjamini and Yekutieli [10] 

Storey, Taylor and Siegmund [57] 

Clarke and Hall [15] 

Unconfounding 

factors 

Friguet, Kloareg and Causeur [25] 
Desai and Storey [18] 

Fan, Han and Gu [21] 

Lan and Du [36] 

Discussed in Sections 5.1 and 5.2 

Confounding 

factors 

Leek and Storey [38, 39] 
Gagnon-Bartsch and Speed [27] 
Sun, Zhang and Owen [60] 
Studied in Sections 2 to j 
Discussed in Section 5.3 

Discussed in Section 5.4 
(future research) 


Table 1 

Selected literature in multiple hypothesis testing under dependence. 

The categorization is partially subjective as some authors do not use exactly the 

same terminology. 


When U = this means the rows of E are i.i.d. N(0,V). We use the 
usual notation in asymptotic statistics that a random variable is Op{l) if it 
is bounded in probability, and Op(l) if it converges to 0 in probability. Bold 
symbols Op{l) or Op(l) mean each entry of the vector is Op(l) or Op(l). 

2. The Model. 


2.1. Linear model with confounders. We consider a single primary vari¬ 
able of interest in this section. It is common to add intercepts and known 
confounder effects (such as lab and batch effects) in the regression model. 
This extension to multiple linear regression does not change the main theo¬ 
retical results in this paper and is discussed in Section 4. 

For simplicity, all the variables in this section are assumed to have mean 
0 marginally. Our model is built on equation (1.1) that is already widely 
used in the existing literature and we rewrite it here: 


(2.1a) 


As mentioned earlier, it is also crucial to model the dependence of the con- 
founders Z and the primary variable X. We assume a linear relationship as 
in (1.2) 

(2.1b) Z = Xa^ + W, 


and in addition some distributional assumptions on X, W and the noise 
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matrix E 

(2.1c) Xi mean 0, variance 1, i = 1,..., n, 

(2.1d) W In, Ir), W±X, 

(2.1e) E In,'E), E ±{X,Z). 

The parameters in the model (2.1) are /3 G the primary effects 

we are most interested in, T G the influence of confounding factors 

on the outcomes, ct G the association of the primary variable with 

the confounding factors, and X) G the noise covariance matrix. We 

assume 5] is diagonal S = diag{af, ..., Cp), so the noise for different outcome 
variables is independent. We discuss possible ways to relax this independence 
assumption in Section 5.4. 

In (2.1c), Xi is not required to be Gaussian or even continuous. For ex¬ 
ample, a binary or categorical variable after normalization also meets this 
assumption. As mentioned in Section 1, the parameter vector a measures 
how severely the data are confounded. For a more intuitive interpretation, 
consider an oracle procedure of estimating [3 when the confounders Z in 
(2.1a) are observed. The best linear unbiased estimator in this case is the 
ordinary least squares (/3®^®, whose variance is cr|Var(Aj, Zj)“^/n. 

Using (2.1b) and (2.Id), it is easy to show that Var(/3®^®) = {l + \\oL\\\)(T‘j/n 
and Cov(/3°'"®,/3®^®) = 0 for j / k. In summary, 

(2.2) Var(/3°^S) = -(1 + ||a||i)5]. 

n 

Notice that in the unconfounded linear model in which Z = 0, the variance 
of the OLS estimator of /3 is S/n. Therefore, I-|- ||q :||2 represents the relative 
loss of efficiency when we add observed variables Z to the regression which 
are correlated with X. In Section 3.2, we show that the oracle efficiency 
(2.2) can be asymptotically achieved even when Z is unobserved. 

Let 6 = (a, f3, F, X)) be all the parameters and 0 be the parameter space. 
Without any constraint, the model (2.1) is unidentifiable. In Sections 2.3 
and 2.4 we show how to restrict 0 to ensure identifiability. 

2.2. Rotation. Following Sun, Zhang and Owen [60], we introduce a 
transformation of the data to make the identification issues clearer. Con¬ 
sider the Householder rotation matrix qT ^ ^nxn qTx = 

ll-^lbei = (II W|| 2 , 0,0,..., 0)^. Left-multiplying Y by we get Y = 
Q^Y = 11 X 11261 / 3 ^ + ZT^ + E, where 

Z = Q^Z = Q^iXcJ' + W) = ||X||2eiQ^ + W, 


(2.3) 
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and W = Q^W = W, E = Q^E = E. As a consequence, the first and the 
rest of the rows of Y are 

(2.4) Yi = ||x||2/3'^ + Zir^ + .Ei ~N(||x||2(/3 + rQ)^,rr'^ + s), 

(2.5) 111 = + E_i ~ MN(0, In-uTT^ + S). 

Here Yi is a 1 x p vector, YLi is a (n — 1) x p matrix, and the distributions 
are conditional on X. 

The parameters a and f3 only appear in (2.4), so their inference (step 1 
in our procedure) can be completely separated from the inference of T and 
X) (step 2 in our procedure). In fact, Yi X Y_i|X because Ei X £l_i, so 
the two steps use mutually independent information. This in turn greatly 
simplifies the theoretical analysis. 

We intentionally use the symbol Q to resemble the QR decomposition of 
X. In Section 4 we show how to use the QR decomposition to separate the 
primary effects from confounder and nuisance effects when X has multiple 
columns. Using the same notation, we discuss how SVA and RUV decouple 
the problem in a slightly different manner in Section 5.3.1. 

2.3. Identifiability of F. Equation (2.5) is just the exploratory factor 
analysis model, thus F can be easily identified up to some rotation under 
some mild conditions. Here we assume a classical sufficient condition for the 
identification of F [2, Theorem 5.1]. 

Lemma 2.1. Let 0 = 0o be the parameter space such that 

1. If any row o/F is deleted, there remain two disjoint submatrices ofT 
of rank r, and 

2. F^S-i F Ip is diagonal and the diagonal elements are distinct, positive, 
and arranged in decreasing order. 

Then F and S are identifiable in the model (2.1). 

In Lemma 2.1, condition (1) requires that p > 2r + 1. Condition (1) 
identifies F up to a rotation which is sufficient to identify (3. To see this, 
we can reparameterize F and cx to VU and U'^cx using an r x r orthogonal 
matrix U. This reparameterization does not change the distribution of Yi 
in (2.4) if (3 remains the same. Condition (2) identifies the rotation uniquely 
but is not necessary for our theoretical analysis in later sections. 

2.4. Identifiability of p. The parameters /3 and a cannot be identified 
from (2.4) because they have in total p + r parameters while Yi is a length p 
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vector. If we write Vr and as the projection onto the column space and 
orthogonal space of F so that /3 = Vrfi + F’r±/3, it is impossible to identify 
VtI3 from (2.4). 

This suggests that we should further restrict the parameter space 0. We 
will reduce the degrees of freedom by restricting at least r entries of [3 to 
equal 0. We consider two different sufficient conditions to identify /3: 

Negative control 0i = {(o;,/3, F, X)) : /3c = 0, rank(Fc) = r} for a 
known negative control set \C\ > r. 

Sparsity 02(s) = {(a,/3, F, S) : ||/3||o < [(P - s)/2j, rank(Fc) = r, VC C 
{1,.. . ,p}, \C\ = s} for some r < s < p. 

Proposition 2.1. // 0 = 0o n 0i or 0 = 0o n 02(s) for some r < 

s < p, the parameters 6 = { a ,( 3 , T , 51) in the model (2.1) are identifiable. 

Proof. Since 0 C 0o, we know from Lemma 2.1 that F and S are iden¬ 
tifiable. Now consider two combinations of parameters = (a W,/3W,F,5]) 
and = ( q;( 2)^ ^(2)^ F, S) both in the space 0 and inducing the same dis¬ 
tribution in the model (2.1), i.e. = /3^^^ -|- 

Let C be the set of indices such that (3^'^ = (3^^^ = 0. If 0 = 0o H 0i, we 
already know |C| > r. If 0 = 0o n 02(s), it is easy to show that |C| > s is 
also true because both and (3^'^'l have at most Y{p—s)/2\ nonzero entries. 
Along with the rank constraint on Fc, this implies that Fco:^^^ = Fcct^^^. 
However, the conditions in 0i and 02 ensure that Fc has full rank, so 
and hence (3^^^ = (3^‘^\ □ 

Remark 2.1. The condition (2) in Lemma 2.1 that uniquely identifies 
F is not necessary for the identification of /3. This is because for any set 
|C| > r and any orthogonal matrix U G we always have rank(Fc) = 

rank(Fc)t/. Therefore F only needs to be identified up to a rotation. 

Remark 2.2. Almost all dense matrices of F G satisfy the condi¬ 
tions. However, for 02 ( 5 ) the sparsity of F allowed depends on the sparsity 
of /3. The condition 02(s) rules out some too sparse F. In this case, one 
may consider using confirmatory factor analysis instead of exploratory fac¬ 
tor analysis to model the relationship between confounders and outcomes. 
For some recent identification results in confirmatory factor analysis, see 
Grzebyk, Wild and Chouaniere [30], Kuroki and Pearl [35]. 

Remark 2.3. The maximum allowed ||/3||o in 02, [(p —r)/2j, is exactly 
the maximum breakdown point of a robust regression with p observations 
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and r predictors [43] . Indeed we use a standard robust regression method to 
estimate (3 in this case in Section 3.2.2. 

Remark 2.4. To the best of our knowledge, the only existing literature 
that explicitly addresses the identifiability issue is Sun [58, Chapter 4.2], 
where the author gives sufficient conditions for local identifiability of (3 by 
viewing (2.1a) as a “sparse plus low rank” matrix decomposition problem. 
See Chandrasekaran, Parrilo and Willsky [14, Section 3.3] for a more gen¬ 
eral discussion of the local and global identifiability for this problem. Local 
identifiability refers to identifiability of the parameters in a neighborhood of 
the true values. In contrast, the conditions in Proposition 2.1 ensure that (3 
is globally identifiable in the restricted parameter space. 

3. Statistical Inference. As mentioned earlier in Section 1, the sta¬ 
tistical inference consists of two steps: the factor analysis (Section 3.1) and 
the linear regression (Section 3.2). 

3.1. Inference for P and S. The most popular approaches for factor 
analysis are principal component analysis (PCA) and maximum likelihood 
(ML). Bai and Ng [7] derived a class of estimators of r by principal com¬ 
ponent analysis using various information criteria. The estimators are con¬ 
sistent under Assumption 3 in this section and some additional technical 
assumptions in Bai and Ng [7]. Due to this reason, we assume the number 
of confounding factors r is known in this section. See Owen and Wang [46, 
Section 3] for a comprehensive literature review of choosing r in practice. 

We are most interested in the asymptotic behavior of factor analysis when 
both ra,p —)■ oo. In this case, PCA cannot consistently estimate the noise 
variance S [3]. For theoretical analysis, we use the quasi maximum likeli¬ 
hood estimate in Bai and Li [3] to get P and 51. This estimator is called 
“quasi”-MLE because it treats the factors Z_i as fixed quantities. Since 
the confounders Z in our model (2.1) are random variables, we introduce a 
rotation matrix R G and let Z^l = Z_i(i?“^)^, P® = Pi? be the 

target factors and factor loadings that are studied in Bai and Li [3] . 

To make Z^l and P^®^ identifiable, Bai and Li [3] consider five different 
identification conditions. However, the parameter of interest in model (2.1) 
is (3 instead of P or As we have discussed in Section 2.4, we only 
need the column space of P to estimate /3, which gives us some flexibility of 
choosing the identification condition. In our theoretical analysis we use the 
third condition (IC3) in Bai and Li [3], which imposes the constraints that 
(n — l)~^(Z^l)"^Z^l = Ir and is diagonal. Therefore, the 

rotation matrix R satisfies RR^ = (n — 1)~^Zf^^Z^i. 
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The quasi-log-likelihood being maximized in Bai and Li [3] is 



where S is the sample covariance matrix of Y-i. 

The theoretical results in this section rely heavily on recent findings in 
Bai and Li [3]. They use these three assumptions. 

Assumption 1. The noise matrix E follows the matrix normal distri¬ 
bution E ~ MN(0, In, S) and 5] is a diagonal matrix. 

Assumption 2. There exists a positive constant D such that ||rj ||2 < 
D, D~^ < for all j, and the estimated variances ct| G , D'^] for 

all j. 

Assumption 3. The limits limp_^oo and limp_^oo 

rj)(rj ® rj) exist and are positive definite matrices. 

Lemma 3.1 (Bai and Li [3]). Under Assumptions 1 to 3, the maximizer 
(r,5]) of the quasi-log-likelihood (3.1) satisfies 


- rf) A N(0, apr), and 



In Appendix A.l, we prove some strengthened technical results of Lemma 3.1 
that are used in the proof of subsequent theorems. 

Remark 3.1. Assumption 2 is Assumption D from [3]. It requires that 
the diagonal elements of the quasi-MLE S be uniformly bounded away from 
zero and infinity. We would prefer boundedness to be a consequence of 
some assumptions on the distribution of the data, but at present we are 
unaware of any other results like Lemma 3.1 which do not use this assump¬ 
tion. In practice, the quasi-likelihood problem (3.1) is commonly solved by 
the Expectation-Maximization (EM) algorithm. Similar to Bai and Li [3, 5], 
we do not find it necessary to impose an upper or lower bound for the 
parameters in the EM algorithm in the numerical experiments. 

3.2. Inference for a and f3. The estimation of o: and f3 is based on the 
first row of the rotated outcome Yi in (2.4), which can be rewritten as 


Yi^/WXh = (3 + T{cx + Wi/WXh) + Ef/WXh 


(3.2) 
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where Wi ~ N(0,/p) is from (2.3) and Wi is independent of Ei ~ N(0, S). 
Note that Ih/ll^lb is proportional to the sample covariance between Y 
and X. All the methods described in this section first try to find a good 
estimator a. They then use /3 = /||X ||2 — To; to estimate /3. 

To reduce variance, we choose to estimate (3.2) conditional on Wi. Also, 
to use the results in Lemma 3.1, we replace T by T^^^ Then, we can rewrite 

(3.2) as 

(3.3) Yf/||X||2 = /3 + r(o)Q(o) + .Ef/||X||2 

where = TR and = R~^{a + Wi/||X|| 2 ). Notice that the random 
R only depends on YLi and thus is independent of Yi. In the proof of the 
results in this section, we first consider the estimation of (3 for fixed Wi, 
R and X, and then show the asymptotic distribution of f3 indeed does not 
depend on Wi, R oi X, and thus also holds unconditionally. 

3.2.1. Negative control scenario. If we know a set C such that /3c = 0 
(so © C ©i), then Yi can be correspondingly separated into two parts: 

Ylfi/WXh = rf aW + Elc/\\Xh, and 

Yi,-c/m2 = P-C + + EI_c/\\X\\2. 

This estimator matches the RUV-4 estimator of [26] except that it uses 
quasi-maximum likelihood estimates of S and F instead of using PCA, and 
generalized linear squares instead of ordinary linear squares regression. The 
details are in Section 5.3.2. 

The number of negative controls \C\ may grow as p —?• oo. We impose an 
additional assumption on the latent factors of the negative controls. 

Assumption 4. limp_ 5 .oo c exists and is positive definite. 

We consider the following negative control (NC) estimator where is 
estimated by generalized least squares: 

(3.5) ^ (f Js-ifc)-if and 

(3.6) = yi^_c/||X||2 - f .cd^'C. 

Our goal is to show consistency and asymptotic variance of /3^^. Let Yq 
represents the noise covariance matrix of the variables in C. We then have 
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Theorem 3.1. Under Assumptions 1 to 4 , if n,p — )• oo and p/n^ — )• 0 
for some k > 0, then for any fixed index set S with finite cardinality and 
5 n C = 0, we have 

(3.7) - Ps) ^ N(0, (1 + ||a||i)(5]5 + As)) 
where As = Ts(T^'Sc^Tc)-^r^. 

If in addition, \C\ —)• oo, the minimum eigenvalue of oo by 

Assumption f, then the maximum entry of As goes to 0. Therefore in this 
ease 

(3.8) V^iP^^ - Ps) 4 N(0, (1 + ||a||i)5]5). 

The asymptotic variance in (3.8) is the same as the variance of the oracle 
least squares in (2.2). Comparable oracle efficiency statements can be found 
in the econometrics literature [8, 63]. This is also the variance used implicitly 
in RUV-4 as it treats the estimated Z as given when deriving test statistics 
for p. When the number of negative controls is not too large, say |C| = 30, 
the correction term As is nontrivial and gives more accurate estimate of 
the variance of 0^^. See Section 6.1 for more simulation results. 


3.2.2. Sparsity seenario. When the zero indices in P are unknown but 
sparse (so 0 C © 2 ), the estimation of cx and P from 1 ^^/||X ||2 = P + 
tWqW + eI/\\X\\2 can be cast as a robust regression by viewing as 
observations and T^*^^ as design matrix. The nonzero entries in P correspond 
to outliers in this linear regression. 

The problem here has two nontrivial differences compared to classical 
robust regression. First, we expect some entries of P to be nonzero, and 
our goal is to make inference on the outliers; second, we don’t observe the 
design matrix T^*^^ but only have its estimator T. In fact, if /3 = 0 and T*-®) 
is observed, the ordinary least squares estimator of is unbiased and 
has variance of order l/{np), because the noise in (3.2) has variance 1/n 
and there are p observations. Our main conclusion is that can still be 
estimated very accurately given the two technical difficulties. 

Given a robust loss function p, we consider the following estimator: 


p 

cx = arg min > p 




D/ 


\X\\2-T]cx 


(3.9) 


and 
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For a broad class of loss functions p, estimating a by (3.9) is equivalent to 

^ 1 . 

(3.11) (a^^,^) = argmin /\\X \\2 - cxf + Px{l3), 

where P\{(3) is a penalty to promote sparsity of /3 [55]. However is 

not identical to /3, which is a sparse vector that does not have an asymp¬ 
totic normal distribution. The LEAPP algorithm [60] uses the form (3.11). 
Replacing it by the robust regression (3.9) and (3.10) allows us to derive 
significance tests of H^j : (3j = 0. 

We assume a smooth loss p for the theoretical analysis: 

Assumption 5. The penalty p : M —)• [0, oo) with p(0) = 0. The function 
p{x) is non-increasing when x < 0 and is non-decreasing when x > 0. The 
derivative ip = p' exists and \^p\ < D for some D < oo. Furthermore, p is 
strongly convex in a neighborhood of 0. 

A sufficient condition for the local strong convexity is that ip' > Q exists in 
a neighborhood of 0. The next theorem establishes the consistency of /3^^. 

Theorem 3.2. Under Assumptions 1 to 3 and 5, ifn,p —>■ oo, p/n^ —>■ 0 
for some k > 0 and ||/3||i/p —)• 0, then A ct. As a consequence, for any 

j, 4 /?,■. 

To derive the asymptotic distribution, we consider the estimating equation 
corresponding to (3.9). By taking the derivative of (3.9), 6;^^ satisfies 

(3.12) A'* 

^ i=i 

The next assumption is used to control the higher order term in a Taylor 
expansion of 'I'. 

Assumption 6. The first two derivatives of ip exist and both \^p'{x)\ < D 
and \'ip''{x)\ < D hold at all x for some D < oo. 

Examples of loss functions p that satisfy Assumptions 5 and 6 include 
smoothed Huber loss and Tukey’s bisquare. 

The next theorem gives the asymptotic distribution of when the 
nonzero entries of (3 are sparse enough. The asymptotic variance of is, 
again, the oracle variance in (2.2). 


El 


B'/ 


\Xh-Tjcx 


rA,RR' 


(T3 


Tj/uj = 0 . 
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Theorem 3.3. Under Assumptions 1 to 3, 5 and 6, ifn,p —>• oo, with 
p/n’^ ^ 0 for some k > 0 and ||/3||i\/n/p —)■ 0, then 

- Ps) ^ N(0, (1 + ||«||i)S5) 
for any fixed index set S with finite cardinality. 

If n/p —>■ 0, then a sufficient condition for ||/3||i\/n/p —)• 0 in Theorem 3.3 
is ||/3||i = 0{^). If instead njp^ c G (0, oo), then ||/3||i = o{,yp) suffices. 

3.3. Hypothesis Testing. In this section, we construct significance tests 
for P and o: based on the asymptotic normal distributions in the previous 
section. 


3.3.1. Test of the primary effects. We consider the asymptotic test for 
Hqj : Pj = 0, j = 1,... ,p resulting from the asymptotic distributions of Pj 
derived in Theorems 3.1 and 3.3. 


(3.13) 




\XhPj 


(T. 


'\/l + 


Q 


12 ’ 


j = 1, • • ■ 


Here we require \C\ —)> oo for the NC estimator. The null hypothesis Hoj 
is rejected at level-a if \tj\ > Zai 2 = — a/2) as usual, where is 

the cumulative distribution function of the standard normal. Note that here 
we slightly abuse the notation a to represent the significance level and this 
should not be confused with the model parameter a. 

The next theorem shows that the overall type-I error and the family- 
wise error rate (FWER) can be asymptotically controlled by using the test 
statistics tj,j = 1,... ,p. 


Theorem 3.4. Let Np = {j\Pj = 0,j = l,...,p} be all the true null 
hypotheses. Under the assumptions of Theorem 3.1 or Theorem 3.3, \C\ —>■ oo 
for the NC scenario, as n,p, {Afpl —)• oo 

I{\tj\ > Zai 2 ) A a, and 


(3.15) limsup P (E > ^Q,/(2p)) ^ 1^ E Ct- 

j&Np 
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Although the individual test is asymptotically valid as tj -4- N(0,1), The¬ 
orem 3.4 is not a trivial corollary of the asymptotic normal distribution in 
Theorems 3.1 and 3.3. This is because tj,j = 1,... ,p are not independent for 
finite samples. The proof of Theorem 3.4 investigates how the dependence of 
the test statistics diminishes when n,p —)> oo. The proof of Theorem 3.4 al¬ 
ready requires a careful investigation of the convergence of /3 in Theorem 3.3. 
It is more cumbersome to prove FDR control using our test statistics. In Sec¬ 
tion 6 we show that FDR is usually well controlled in simulations for the 
Benjamini-Hochberg procedure when the sample size is large enough. 

Remark 3.2. We find a calibration technique in Sun, Zhang and Owen 
[60] very useful to improve the type I error and FDR control for finite sample 
size. Because the asymptotic variance used in (3.13) is the variance of an 
oracle OLS estimator, when the sample size is not sufficiently large, the vari¬ 
ance of should be slightly larger than this oracle variance. To correct 
for this inflation, one can use median absolute deviation (MAD) with cus¬ 
tomary scaling to match the standard deviation for a Gaussian distribution 
to estimate the empirical standard error of tj,j = 1,... ,p and divide tj by 
the estimated standard error. The performance of this empirical calibration 
is studied in the simulations in Section 6.1. 

3.3.2. Test of confounding. We also consider a significance test for Hq ^ ■ 
Q: = 0, under which the latent factors are not confounding. 

Theorem 3.5. Let the assumptions of Theorem 3.1 or Theorem 3.3 and 
jCj —)> oo for the NC scenario be given. Under the null hypothesis that ci = 0, 
for a = in (3.5) or a = in (3.9), we have 

'^T '' d 2 

n • a. cx ^ Xr 

where Xr is the chi-square distribution with r degree of freedom. 

Therefore, the null hypothesis Flo,a : Q = 0 is rejected if n ■ o^a > 
Xra where Xra is the upper-a quantile of Xr- This test, combined with 
exploratory factor analysis, can be used as a diagnosis tool for practitioners 
to check whether the data gathering process has any confounding factors 
that can bias the multiple hypothesis testing. 

4. Extension to Mnltiple Regression. In Sections 2 and 3 we as¬ 
sume that there is only one primary variable X and all the random variables 
X, Y and Z have mean 0. In practice, there may be several predictors, or 
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we may want to include an intercept term in the regression model. Here we 
develop a multiple regression extension to the original model (2.1). 

Suppose we observe in total d = + di random predictors that can be 

separated into two groups: 

1. Xo : n X do nuisance covariates that we would like to include in the 
regression model, and 

2. Xi : n X di primary variables whose effects we want to study. 

For example, the intercept term can be included in Xq as a n x 1 vector of 
1 (i.e. a random variable with mean 1 and variance 0). 

Leek and Storey [39] consider the case do = 0 and di > 1 for SVA and 
Sun, Zhang and Owen [60] consider the case do >0 and di = 1 for LEAPP. 
Here we study the confounder adjusted multiple regression in full generality, 
for any do > 0 and di > 1. Our model is 


(4.1a) 

(4.1b) 

(4.1c) 

(4.1d) 


Y = XoB^ + XiBj + ZT^ + E, 


are i.i.d. with E 


Xoi 


Xo^ 

Xu 


Z \ (Xo, Xi) ~ MN(XoA^ + XiAf, /„, /,), 
E X (Xo, Xi,Z), E^ MN(0, E, S). 




and 


The model does not specify means for Xq* and Xu] we do not need them. 
The parameters in this model are, for i = 0 or 1, Bj G B G 

Yx £ R'^^'^, and A* G The parameters A and B are the matrix 

versions of a and /3 in model (2.1). Additionally, we assume Yx is invertible. 
To clarify our purpose, we are primarily interested in estimating and testing 
for the significance of Bi. 

For the multiple regression model (4.1), we again consider the rotation 
matrix that is given by the QR decomposition (Xq Xi) = QU where 
Q G R”^"" is an orthogonal matrix and U is an upper triangular matrix of 
size n X d. Therefore we have 

fUoo Uoi\ 

(Xo Xi) = = 0 ?7ii 

Vo 0 / 

where Uoo is a do x do upper triangular matrix and Uu is a di x di upper 
triangular matrix. Now let the rotated Y be 

(4.2) Y = Q^Y = ? j 
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where Yq is do x P, Yi is di x p and Y^_i is {n — d) x p, then we can partition 
the model into three parts: conditional on both Xq and Xi (hence U), 

(4.3) Yo = i7ooB^ + UoiBj + ZqT^ + Eq, 

(4.4) Yi = UuBJ + ZiT^ + ~ MN(i7ii(Bi + TAi)'^, + E) 

(4.5) Yhi = + E_i ~ MN(0, In-d, TY^ + S) 

where Z = Q^Z and E = Q^E = E. Equation (4.3) corresponds to the 
nuisance parameters Bq and is discarded according to the ancillary principle. 
Equation (4.4) is the multivariate extension to (2.4) that is used to estimate 
Bi and equation (4.5) plays the same role as (2.5) to estimate Y and S. 

We consider the asymptotics when n,p —)• 00 and d, r are fixed and known. 
Since d is fixed, the estimation of Y is not different from the simple regression 
case and we can use the maximum likelihood factor analysis described in 
Section 3.1. Under Assumptions 1 to 3, the precision results of Y and S 
(Lemma A.l) still hold. 

Let = Li = (^°° C'i’O' proof of Theorems 3.1 and 3.3, we 

consider a fixed sequence of X such that ||X|| 2 /\/n —)• 1. Similarly, we have 
the following lemma in the multiple regression scenario: 

Lemma 4.1. As n 00 , 

Similar to (3.2), we can rewrite (4.4) as 

Y^U{^ = Bi + Y(Ai + WiU^^) + EiU^^ 

where Wi ~ MN{0, Ip) is independent from Ei. As in Section 3.2, we 
derive statistical properties of the estimate of Bi for a fixed sequence of 
A, lUi and Z, which also hold unconditionally. Eor simplicity, we assume 
that the negative controls are a known set of variables C with Bi^c = 0. 
We can then estimate each column of Ai by applying the negative control 
(NC) or robust regression (RR) we discussed in Sections 3.2.1 and 3.2.2 to 
the corresponding row of YiU^^, and then estimate Bi by 

Bi = uf - YAi. 

Notice that EiU^^ ~ MN(0, Yl, . Thus the “samples” in the ro¬ 

bust regression, which are actually the p variables in the original problem 
are still independent within each column. Though the estimates of each 
column of Ai may be correlated, we will show that the correlation won’t 
affect inference on Bi. As a result, we still get asymptotic results similar to 
Theorem 3.3 for the multiple regression model (4.1): 
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Theorem 4.1. Under Assumptions 1 to 6, if n,p — >• cx), with p/n^ — )• 0 
for some k > 0, and ||vec(Bi)||i\/n/p —)■ 0, then for any fixed index set S 
with finite cardinality |5|, 

(4.6) ^/^(B^|5-Bi,s)4MN(0|5|xfci,5]5 + A5,nii+AfAi), and 

(4.7) V^(B^5 - Bi,s) 4 MN(0|5|xfc,, ^5, fill + Af Ai) 
where A 5 is defined in Theorem 3.1. 


As for the asymptotic efficiency of this estimator, we again compare it 

to the oracle OLS estimator of Bi which observes confounding variables Z 

'' RR, 

in (4.1). In the multiple regression model, we claim that B^ still reaches 

the oracle asymptotic efficiency. In fact, let B = (Bq Bi T). The oracle 
^ OLS 

OLS estimator of B, B , is unbiased and its vectorization has variance 
n where 


V = 


ASx 


SxA'^ 

Ir + AEx A^ 



Ai). 


By the block-wise matrix inversion formula, the top left dx d block of V ^ 
is + A A. The variance of B^ only depends on the bottom right 
di X di sub-block of this dxd block, which is simply flu -|-A^Ai. Therefore 
B^ is unbiased and its vectorization has variance (fin -|- A^Ai) 0 51/n, 
matching the asymptotic variance of B^ in Theorem 4.1. 


5. Discussion. 


5.1. Confounding vs. unconfounding. The issue of multiple testing de¬ 
pendence arises because Z in the true model (1.1) is unobserved. We have 
focused on the case where Z is confounded with the primary variable. Some 
similar results were obtained earlier for the unconfounded case, correspond¬ 
ing to Q: = 0 in our notation. For example, Lan and Du [36] used a factor 
model to improve the efficiency of significance tests of the regression inter¬ 
cepts. Jin [32], Li and Zhong [41] developed more powerful procedures for 
testing (3 while still controlling FDR under unconfounded dependence. 

In another related work, Fan, Han and Gu [21] imposed a factor structure 
on the unconfounded test statistics, whereas this paper and the articles 
discussed later in Section 5.3 assume a factor structure on the raw data. Fan, 
Han and Gu [21] used an approximate factor model to accurately estimate 
the false discovery proportion. Their correction procedure also includes a 
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step of robust regression. Nevertheless, it is often difficult to interpret the 
factor structure of the test statistics. In comparison, the latent variables Z 
in our model (2.1), whether confounding or not, can be interpreted as batch 
effects, laboratory conditions, or other systematic bias. Such problems are 
widely observed in genetics studies (see e.g. the review article [40]). 

As a final remark, some of the models and methods developed in the 
context of unconfounded hypothesis testing may be useful for confounded 
problems as well. For example, the relationship between Z and X. needs not 
be linear as in (1.2). In certain applications, it may be more appropriate to 
use a time-series model [59] or a mixture model [20]. 

5.2. Marginal effects vs. direct effects. In Section 1, we switched our 
interest from the marginal effects r in (1.3) to the direct effects f3. We 
believe that they are usually more scientifically meaningful and interpretable 
than the marginal effects. For instance, if the treated (control) samples are 
analyzed by machine A (machine B), and the machine A outputs higher 
values than B, we certainly do not want to include the effects of this machine 
to machine variation on the outcome measurements. 

When model (2.1) is interpreted as a “structural equations model” [12], (3 
is indeed the causal effect of X on [47]. In this paper we do not make such 
structural assumptions about the data generating process. Instead, we use 
(2.1) to describe the screening procedure commonly applied in high through¬ 
put data analysis. The model (2.1) also describes how we think the marginal 
effects can be confounded and hence different from the more meaningful di¬ 
rect effects (3. Additionally, the asymptotic setting in this paper is quite 
different from that in the traditional structural equations model. 

5.3. Comparison with existing confounder adjustment methods. We dis¬ 
cuss in more detail how previous methods of confounder adjustment, namely 
SVA [38, 39], RUV-4 [26, 27] and LEAPP [60], fit in the framework (2.1). 
See Perry and Pillai [48] for an alternative approach of bilinear regression 
with latent factors that is also motivated by high-throughput data analysis. 

5.3.1. SVA. There are two versions of SVA: the reduced subset SVA 
(subset-SVA) of Leek and Storey [38] and the iteratively reweighted SVA 
(IRW-SVA) of Leek and Storey [39] . Both of them can be interpreted as the 
two-step statistical procedure in the framework (2.1). In the first step, SVA 
estimates the confounding factors by applying PC A to the residual matrix 
{I-Hx)Y where Hx = X{X^X)-^X^ is the projection matrix of X. In 
contrast, we applied factor analysis to the rotated residual matrix (Q^V)_i, 
where Q comes from the QR decomposition of X in Section 4. To see why 
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these two approaches lead to the same estimate of F, we introduce the block 
form of Q = (Qi Q 2 ) where Qi G and Q 2 G It is easy to 

show that {Q'^Y)^i = Q^Y and {I — Hx)Y = Q 2 Q 2 Y. Thus our rotated 
matrix decorrelates the residual matrix by left-multiplying by Q 2 

(because Q 2 Q 2 = In-d)- Because {QIyYqIY = {Q2Q^YfQ2Q^Y, 
(g^Y)_i and (/ - Hx)Y have the same sample covariance matrix, they 
will yield the same factor loading estimate under PCA and also under MLE. 
The main advantage of using the rotated matrix is theoretical: the rotated 
residual matrices have independent rows. 

Because SVA doesn’t assume an explicit relationship between the primary 
variable X and the confounders Z, it cannot use the regression (3.2) to 
estimate a (not even defined) and (3. Instead, the two SVA algorithms try 
to reconstruct the surrogate variables, which are essentially the confounders 
Z in our framework. Assuming the true primary effect (3 is sparse, the 
subset-SVA algorithm finds the outcome variables Y that have the smallest 
marginal correlation with X and uses their principal scores as Z. Then, 
it computes the p-values by F-tests comparing the linear regression models 
with and without Z. This procedure can easily fail because a small marginal 
correlation does not imply no real effect of X due to the confounding factors. 
For example, most of the marginal effects in the gender study in Figure lb 
are very small, but after confounding adjustment we find some are indeed 
significant (see Section 6.2). 

The IRW-SVA algorithm modifies subset-SVA by iteratively choosing the 
subset.At each step, IRW-SVA gives a weight to each outcome variable based 
on how likely fij = 0 the current estimate of surrogate variables. The weights 
are then used in a weighted PCA algorithm to update the estimated surro¬ 
gate variables. IRW-SVA may be related to our robust regression estimator 
in Section 3.2.2 in the sense that an M-estimator is commonly solved by It¬ 
eratively Reweighted Least Squares (IRLS) and the weights also represents 
how likely the data point is an outlier. However, unlike IRLS, the iteratively 
reweighted PCA algorithm is not even guaranteed to converge. Some pre¬ 
vious articles [26, 60] and our experiments in Section 6.1 and Supplement 
[64] show that SVA is outperformed by the NC and RR estimators in most 
confounded examples. 

5.3.2. RUV. Gagnon-Bartsch, Jacob and Speed [26] derived the RUV-4 
estimator of (3 via a sequence of heuristic calculations. In Section 3.2.1, we 
derived an analytically more tractable estimator which is actually the 
same as RUV-4, with the only difference being that we use MLF instead of 
PCA to estimate the factors and GLS instead of OLS in (3.5). To see why 
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is essentially the same as in the first step of RUV-4 it uses the 

residual matrix to estimate F and Z, which yields the same estimate as using 
the rotated matrix (Section 5.3.1). In the second step, RUV-4 estimates /3 via 
a regression of V on X and Z = Q {Z^^ ck^) . This is equivalent to using 
ordinary least squares (OLS) to estimate a in (3.4). Based on more heuristic 
calculations, the authors claim that the RUV-4 estimator has approximately 
the oracle variance. We rigorously prove this statement in Theorem 3.1 when 
the number of negative controls is large and give a finite sample correction 
when the negative controls are few. In Section 6.1 we show this correction 
is very useful to control the type I error and FDR in simulations. 

5.3.3. LEAPP. We follow the two-step procedure and robust regression 
framework in LEAPP [60] in this paper, thus the test statistics are 
very similar to the test statistics in LEAPP. The difference is that LEAPP 
uses the 0-IPOD algorithm of She and Owen [55] for outlier detection, 
which is robust against outliers at leverage points but is not easy to analyze. 
Indeed Sun, Zhang and Owen [60] replaced it by the Dantzig selector in 
its theoretical appendix. The classical M-estimator, although not robust to 
leverage points [65] , allows us to study the theoretical properties more easily. 
In practice, LEAPP and RR estimator usually produce very similar results; 
see Section 6.1 for a numerical comparison. 

5.4. Inference when S is nondiagonal. Our analysis is based on the as¬ 
sumption that the noise covariance matrix 5] is diagonal, though in many 
applications, the researcher might suspect that the outcome variables Y 
in model (2.1) are still correlated after conditioning on the latent factors. 
Typical examples include gene regulatory networks [17] and cross-sectional 
panel data [49], where the variable dependence sometimes cannot be fully 
explained by the latent factors or may simply require too many of them. Bai 
and Li [6] extend the theoretical results in Bai and Li [3] to approximate fac¬ 
tor models allowing for weakly correlated noise. Approximate factor models 
have also been discussed in Pan and Han [22]. 

6. Numerical Experiments. 

6.1. Simulations. We have provided theoretical guarantees of confounder 
adjusting methods in various settings and the asymptotic regime of n,p —>■ oo 
(e.g. Theorems 3.1 to 3.4 and 4.1). Now we use numerical simulations to 
verify these results and further study the hnite sample properties of our 
estimators and tests statistics. 
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The simulation data are generated from the single primary variable model 

(2.1) . More specifically, Xi is a centered binary variable (Xj + l)/2 
Bernoulli(0.5), and Y), Zi are generated according to (2.1). 

For the parameters in the model, the noise variances are generated by 
0-2 invGamma(3,2), j = and so IE(a'|) = Var((T|) = 1. We 

set each = ||Q:|| 2 /-v/r equally for A; = 1, 2, • • • , r where ||q:||| is set to 1, 
so the variance of Xi explained by the confounding factors is = 50%. 
(Additional results for = 5% and 0 are in the Supplement.) The primary 
effect (3 has independent components f3i taking the values Sy^l + ||q :||2 and 
0 with probability vr = 0.05 and 1 — vr = 0.95, respectively, so the nonzero 
effects are sparse and have effect size 3. This implies that the oracle estimator 
has power approximately P(N(3,1) > ^^ 0 . 025 ) = 0.85 to detect the signals at 
a significance level of 0.05. We set the number of latent factors r to be either 

2 or 10. For the latent factor loading matrix F, we take F = FZ) where F 
is a p X r orthogonal matrix sampled uniformly from the Stiefel manifold 
14 ( 1 ^^)) the set of all p x r orthogonal matrix. Based on Assumption 3, 
we set the latent factor strength D = ^yp • diag(di,-- - ,dr) where = 

3 — 2{k — l)/(r — 1) thus di to dr are distributed evenly inside the interval 

[3.1] . As the number of factors r can be easily estimated for this strong 
factor setting (more discussions can be found in Owen and Wang [46]), we 
assume that the number r of factors is known to all of the algorithms in this 
simulation. 

We set p = 5000, n = 100 or 500 to mimic the data size of many ge¬ 
netic studies. For the negative control scenario, we choose \C\ =30 negative 
controls at random from the zero positions of /3. We expect that negative 
control methods would perform better with a larger value of |C| and worse 
with a smaller value. The choice \C\ = 30 is around the size of the spike-in 
controls in many microarray experiments [27]. For the loss function in our 
sparsity scenario, we use Tukey’s bisquare which is optimized via IRLS with 
an ordinary least-square fit as the starting values of the coefficients. Finally, 
each of the four combinations of n and r is randomly repeated 100 times. 

We compare the performance of nine different approaches. There are two 
baseline methods: the “naive” method estimates /3 by a linear regression of 
Y on just the observed primary variable X and calculates p-values using the 
classical t-tests, while the “oracle” method regresses Y on both X and the 
confounding variables Z as described in Section 2.1. There are three methods 
in the RUV-4/negative controls family: the RUV-4 method [26], our “NC” 
method which computes test statistics using and its variance estimate 
(1 + + ^)) and our “NC-ASY” method which uses the same 0^^ 

but estimates its variance by (1 -|-||q:||2 )^. We compare four methods in the 
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SVA/LEAPP/sparsity family: these are “IRW-SVA” [39], “LEAPP” [60], 
the “LEAPP(RR)” method which is our RR estimator using M-estimation 
at the robustness stage and computes the test-statistics using (3.13), and 
the “LEAPP(RR-MAD)” method which uses the median absolute deviation 
(MAD) of the test statistics in (3.13) to calibrate them, (see Section 3.3) 

To measure the performance of these methods, we report the type I error 
(Theorem 3.4), power, false discovery proportion (FDP) and precision of 
hypotheses with the smallest 100 p-values in the 100 simulations. Eor both 
the type I error and power, we set the significance level to be 0.05. Eor FDP, 
we use Benjamini-Hochberg procedure with FDR controlled at 0.2. These 
metrics are plotted in Figure 2 under different settings of n and r. 

First, from Figure 2, we see that the oracle method has exactly the same 
type I error and FDP as specified, while the naive method and SVA fail 
drastically. SVA performs performs better than the naive method in terms of 
the precision of the smallest 100 p-values, but is still much worse than other 
methods. Next, for the negative control scenario, as we only have |C| = 30 
negative controls, ignoring the inflated variance term A 5 in Theorem 3.1 
will lead to overdispersed test statistics, and that’s why the type I error and 
FDP of both NC-ASY and RUV-4 are much larger than the nominal level. 
By contrast, the NC method correctly controls type I error and FDP by 
considering the variance inflation, though as expected it loses some power 
compared with the oracle. For the sparsity scenario, the “LEAPP(RR)” 
method performs as the asymptotic theory predicted when n = 500, while 
when n = 100 the p-values seem a bit too small. This is not surprising 
because the asymptotic oracle variance in Theorem 3.3 can be optimistic 
when the sample size is not sufficiently large, as we discussed in Remark 3.2. 
On the other hand, the methods which use empirical calibration for the 
variance of test statistics, namely the original LEAPP and “LEAPP(RR- 
MAD)”, control both FDP and type I error for data of small sample size in 
our simulations. The price for the hnite sample calibration is that it tends 
to be slightly conservative, resulting in a loss of power to some extent. 

In conclusion, the simulation results are consistent with our theoretical 
guarantees when p is as large as 5000 and n is as large as 500. When n is 
small, the variance of the test statistics will be larger than the asymptotic 
variance for the sparsity scenario and we can use empirical calibrations (such 
as MAD) to adjust for the difference. 

6.2. Real data examples. In this section, we return to the three motivat¬ 
ing real data examples in Section I. The main goal here is to demonstrate a 
practical procedure for confounder adjustment and show that our asymptotic 
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BType I error EPower |*FDP |ElTop100 

Fig 2: Compare the performance of nine different approaches (from left to right): 
naive regression ignoring the confounders (Naive), IRW-SVA, negative control with 
finite sample correction (NC) in (3.7), negative control with asymptotic oracle vari¬ 
ance (NC-ASY) in (3.8), RUV-4, robust regression (LEAPP(RR)), robust regression 
with calibration (LEAPP(RR-MAD)), LEAPP, oracle regression which observes the 
confounders (Oracle). The error bars are one standard deviation over 100 repeated 
simulations. The three dashed horizontal lines from bottom to top are the nominal 
significance level, FDR level and oracle power, respectively. 
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results are reasonably accurate in real data. In an open-source R package 
cate (available on CRAN), we also provide the necessary tools to carry out 
the procedure. 

6.2.1. The datasets. First we briefly describe the three datasets. The 
first dataset [56] is tries to identify candidate genes associated with the 
extent of emphysema and can be downloaded from the GEO database (Series 
GSE22148). We preprocessed the data using the standard Robust Multi¬ 
array Average (RMA) approach [31]. The primary variable of interest is the 
severity (moderate or severe) of the Ghronic Obstructive Pulmonary Disease 
(GOPD). The dataset also include age, gender, batch and date of the 143 
sampled patients which are served as nuisance covariates. 

The second and third datasets are taken from Gagnon-Bartsch, Jacob and 
Speed [26] where they used them to compare RUV methods with other meth¬ 
ods such as SVA and LEAPP. The original scientific studies are Vawter et al. 
[62] and Blalock et al. [11], respectively. The primary variable of interest is 
gender in both datasets, though the original objective in Blalock et al. [11] 
is to identify genes associated with Alzheimer’s disease. Gagnon-Bartsch, 
Jacob and Speed [26] switch the primary variable to gender in order to have 
a gold standard: the differentially expressed genes should mostly come from 
or relate to the X or Y chromosome. We follow their suggestion and use 
this standard to study the performance of our RR estimator. In addition, as 
the first GOPD dataset also contains gender information of the samples, we 
apply this suggestion and use gender as the primary variable for the GOPD 
data as a supplementary dataset. 

Pinally, we want to mention that the second dataset has repeated sam¬ 
ples from the same individuals while the individual information is lost. We 
suspect that the individual information are then strong latent factors which 
caused the atypical concentration of the histograms in Figure lb and Fig¬ 
ure Id. This suggests necessity of a latent factor model for this dataset. 

6.2.2. Confounder adjustment. Recall that without the confounder ad¬ 
justment, the distribution of the regression f-statistics in these datasets can 
be skewed, noncentered, underdispersed, or overdispersed as shown in Fig¬ 
ure 1. The adjustment method used here is the maximum likelihood factor 
analysis described in Section 3.1 followed by the robust regression (RR) 
method with Tukey’s bisquare loss described in Section 3.2.2. Since the true 
number of confounders is unknown, we increase r from 1 to n/2 and study 
the empirical performance. We report the results without empirical calibra¬ 
tion for illustrative purposes, though in practice we suggest using calibration 
for better control of type I errors and FDP. 


CONFOUNDER ADJUSTMENT 


29 


r 

mean 

median 

sd 

mad 

skewness 

medc. 

#sig. 

p-value 

0 

-0.16 

0.024 

2.65 

2.57 

-0.104 

-0.091 

164 

NA 

1 

-0.45 

-0.39 

2.85 

2.52 

-0.25 

0.00074 

1162 

0.0057 

2 

0.012 

-0.039 

1.35 

1.33 

0.139 

0.042 

542 

<le-10 

3 

0.014 

-0.05 

1.43 

1.41 

0.169 

0.048 

552 

<le-10 

5 

-0.029 

-0.11 

1.52 

1.48 

0.236 

0.057 

647 

<le-10 

7 

-0.1 

-0.14 

1.42 

1.35 

0.109 

0.027 

837 

<le-10 

10 

-0.06 

-0.085 

1.13 

1.12 

0.103 

0.022 

506 

<le-10 

20 

-0.083 

-0.095 

1.2 

1.19 

0.0604 

0.0095 

479 

<le-10 

33 

- 0.099 

- 0.11 

1.33 

1.3 

0.0727 

0.0056 

579 

< le -10 

40 

-0.1 

-0.12 

1.43 

1.4 

0.0775 

0.0072 

585 

<le-10 

50 

-0.16 

-0.17 

1.58 

1.53 

0.0528 

0.0032 

678 

<le-10 


(a) Dataset 1 (n = 143, p = 54675). Primary variable: severity of COPD. 


r 

mean 

median 

sd 

mad 

skewness 

medc. 

#sig. 

X/Y top 100 

p-value 

0 

0.11 

0.043 

0.36 

0.237 

2.99 

0.2 

1036 

58 

11 

NA 

1 

-0.44 

-0.47 

1.06 

1.04 

0.688 

0.035 

108 

20 

20 

0.74 

2 

-0.14 

-0.15 

1.15 

1.13 

0.601 

0.015 

113 

21 

21 

0.31 

3 

0.013 

0.012 

1.13 

1.08 

0.795 

-0.01 

168 

34 

28 

0.03 

5 

0.044 

0.019 

1.18 

1.08 

0.878 

0.017 

238 

32 

27 

0.0083 

7 

0.03 

0.012 

1.26 

1.15 

0.784 

0.0062 

269 

35 

25 

0.006 

10 

0.023 

0.00066 

1.36 

1.24 

0.661 

0.011 

270 

38 

27 

0.019 

15 

0.049 

0.022 

1.46 

1.31 

0.584 

0.012 

296 

36 

29 

0.00082 

20 

0.029 

-0.0009 

1.53 

1.36 

0.502 

0.019 

314 

36 

28 

7.2e-07 

25 

0.048 

0.012 

1.68 

1.48 

0.452 

0.026 

354 

37 

27 

l . le -06 

30 

0.026 

0.012 

1.82 

1.61 

0.436 

0.0068 

337 

40 

27 

8.7e-08 

40 

0.061 

0.046 

2.07 

1.79 

0.642 

0.0028 

363 

41 

27 

7.7e-10 


(b) Dataset 2 (n = 84, p = 12600). Primary variable: gender. 


r 

mean 

median 

sd 

mad 

skewness 

medc. 

#sig. 

X/Y top 100 

p-value 

0 

-1.8 

-1.8 

0.599 

0.513 

-3.46 

0.082 

418 

39 

20 

NA 

1 

-0.55 

-0.56 

1.09 

1.01 

-1.53 

0.01 

261 

29 

23 

0.00024 

2 

-0.2 

-0.22 

1.2 

1.11 

-0.99 

0.014 

320 

38 

22 

0.00014 

3 

-0.096 

-0.12 

1.27 

1.18 

-0.844 

0.017 

311 

42 

25 

0.00014 

5 

-0.33 

-0.32 

1.31 

1.22 

-1.29 

-0.011 

305 

35 

23 

2.1e-07 

7 

-0.37 

-0.36 

1.46 

1.36 

-0.855 

-0.0099 

300 

38 

23 

4.0e-07 

11 

- 0.13 

- 0.12 

1.51 

1.36 

- 0.601 

- 0.0051 

432 

48 

31 

1 . 8 e -09 

15 

-0.12 

-0.13 

1.83 

1.62 

-0.341 

0.013 

492 

54 

25 

2.3e-08 

20 

-0.13 

-0.14 

2.61 

2.23 

-0.327 

0.0045 

613 

50 

26 

4.0e-06 


(c) Dataset 3 (n = 31, p = 22283). Primary variable: gender. 

Table 2 

Summary of the adjusted z-statistics. The first group is summary statistics of the 
z-statistics before the empirical calibration. The second group is some performance 
metrics after the empirical calibration, including total number of significant genes 
of p-value less than 0.01 in Remark 3.2 (#sig.), number of the genes on X/Y 
chromosome that have p-value less than 0.01 (X/Y), the number among the 100 
most significant genes that are on the X/Y chromosome (top 100) and the p-value 
of the confounding test in Section 3.3.2. The bold row corresponds to the r selected 

by BCV (Figure 3). 
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r t-statistics 


(a) Dataset 1: BCV selects r = 33. 


(b) Dataset 1: histogram. 




(c) Dataset 2: BCV selects r = 25. (d) Dataset 2: histogram. 




r t-statistics 


(e) Dataset 3: BCV selects r = 11. (f) Dataset 3: histogram. 

Fig 3: Histograms of z-statistics after confounder adjustment (without calibration) 
using the number of confounders r selected by bi-cross-validation. 
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In Table 2 and Figure 3, we present the results after confounder adjust¬ 
ment for the three datasets. We report two groups of summary statistics in 
Table 2: the first group is several summary statistics of all the z-statistics 
computed using (3.13), including the mean, median, standard deviation, 
median absolute deviation (scaled for consistency of normal distribution), 
skewness, and the medcouple. The medcouple [13]) is a robust measure of 
skewness. After subtracting the median observation some positive and some 
negative values remain. For any pair of values xi > 0 and X2 < 0 with 
xi + \x 2 \ > 0 one can compute (xi — |x 2 |)/(ti -|- |x 2|). The medcouple is the 
median of all those ratios. The second group of statistics has performance 
metrics to evaluate the effectiveness of the confounder adjustment. See the 
caption of Table 2 for more detail. 

In all three datasets, the z-statistics become more centered at 0 and less 
skewed as we include a few confounders in the model. Though the stan¬ 
dard deviation (SD) suggests overdispersed variance, the overdispersion will 
go away if we add MAD calibration as SD and MAD have similar values. 
The similarity between SD and MAD values also indicates that the ma¬ 
jority of statistics after confounder adjustment are approximately normally 
distributed. Note that the medcouple values shrink towards zero after ad¬ 
justment, suggesting that skewness then only arises from small fraction of 
the genes, which is in accordance with our assumptions that the primary 
effects should be sparse. 

In practice, some latent factors may be too weak to meet Assumption 3 
(i.e. dj <C , making it difficult to choose an appropriate r. A practical 
way to pick the number of confounders r with presence of heteroscedastic 
noise we investigate here is the bi-cross-validation (BCV) method of Owen 
and Wang [46], which uses randomly held-out submatrices to estimate the 
mean squared error of reconstructing factor loading matrix. It is shown 
in Owen and Wang [46] that BCV outperforms many existing methods in 
recovering the latent signal matrix and the number of factors r, especially 
in high-dimensional datasets (n,p —)• oo). In Figure 3, we demonstrate the 
performance of BCV on these three datasets. The r selected by BCV is 
respectively 33, 25 and 11 (Figures 3a, 3c and 3e), and they all result in 
the presumed shape of z-statistics distribution (Figures 3b, 3d and 3f). For 
the second and the third datasets where we have a gold standard, the r 
selected by BCV has near optimal performance in selecting genes on the 
X/Y chromosome (columns 3 and 4 in Tables 2b and 2c). Another method 
we applied is proposed by Onatski [44] based on the empirical distribution of 
eigenvalues. This method estimates r as 2, 9 and 3 respectively for the three 
datasets. Table 3 of Gagnon-Bartsch, Jacob and Speed [26] has the “top 
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100” values for RUV-4 on the second and third dataset. They reported 26 
for LEAPT, 28 for RUV-4, and 27 for SVA in the second dataset, and 27 for 
LEAPP, 31 for RUV-4, and 26 for SVA in the third dataset. Notice that the 
precision of the top 100 significant genes is relatively stable when r is above 
certain number. Intuitively, the factor analysis is applied to the residuals 
of V on JV and the overestimated factors also have very small eigenvalues, 
thus they usually do not change /3 a lot. See also Gagnon-Bartsch, Jacob 
and Speed [26] for more discussion on the robustness of the negative control 
estimator to overestimating r. 

Lastly we want to point out that both the small sample size of the datasets 
and presence of weak factors can result in overdispersed variance of the test 
statistics. The BCV plots indicate presence of many weak factors in the 
first two datasets. In the third dataset, the sample size n is only 31, so the 
adjustment result is not ideal. Nevertheless, the empirical performance (e.g. 
number of X/Y genes in top 100) suggests it is still benehcial to adjust for 
the confounders. 


APPENDIX A: PROOES 

A.l. More technical results of factor analysis. Here we prove uni¬ 
form convergence of the estimated factors and noise variances based on the 
results of Bai and Li [3], which are needed to prove Theorems 3.1 to 3.4. In 
the proof of the following lemma, we intensively use some of the technical 
results in Bai and Li [3] and also modify internal parts of their proof. Before 
reading the proof of Lemma A.l, we recommend that the reader first read 
the original proof in Bai and Li [3, 4]. To help the readers to follow, the 
variables N, T, A (or A*) and / (or /*) in Bai and Li [3] correspond to p, 
n, r(0) and in our notation. 

Lemma A.l. Under Assumptions 1 to 3, for any fixed index set S with 
finite cardinality, 

(A.l) 

where EI5 is the noise covariance matrix of the variables in S. Further, if 

there exists k > 0 such that p/n^ ^ 0 when p —)■ 00, then 

(A.2) 

max jcr^ — a‘^\ = Op(y/logp/n), max |f,■ — = Op{\/logp/n), and 

l<j<P 



n — 1 

i=2 





(A.3) 


max 
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Remark A.l. If we directly apply the results in Bai and Li [3] to prove 
Lemma 3.1, we need uniform boundedness of which is not always true. 
However, it is easy to show R Ir by applying Bai and Li [3, Lemma 
A.l]. Also, as RRF is the sample covariance matrix, the maximum entry of 
|i? — I\ is Op(n“^/^), thus the maximum entry of — F| = |F(i? — F)| 
is also As a consequence, although F^*^) is not always uniformly 

bounded, all the results in Bai and Li [3] still hold as we stated in Lemma 3.1 
and Lemma A.l. 

Proof. Our factor model corresponds to the IC3 identihcation condition 
in Bai and Li [3]. Equation (A.l) is an immediate consequence of Bai and 
Li [3, Theorem 5.2], except here we additionally consider the asymptotic 
covariance of ^/n{Tj — and -y/n(f — F^°^). The asymptotic distribution 
of -y/n(f 5 — F®) immediately follows from equation (F.l) in Bai and Li [4]: 


^ n 

(A.4) - Ff ) = —= ^ Zf ^4- + Op(l). 

Now we prove (A.2). Let Fj — F^-°^ = hij + h 2 j + ■ ■ ■ + bio,j where 
represents the /cth term in the right hand side of equation (A. 14) in Bai and 
Li [3]. Also, let — (t| = aij + a 2 j + • • • + oioj where akj represents the kth 
term in the right hand side of equation (B.9) in Bai and Li [4]. To bound 
each bj and aj term, we extensively use Lemma C.l of Bai and Li [4]. First, 
we give a clearer approximation to replace (a) and (c) in Lemma C.l of Bai 
and Li [4]: 

(A.5) ||iTf^S-^(f - F(°))||i. = Opin-^) + Op{n-^l‘^v~^l‘^) 

and 

(A.6) = Op(n-i/V^''") + Op(n“^) 

n — 1 " wr t' 

where M = and || • \\f is the Frobenius norm. To show (A.6), 

one just needs to apply Hp = pH = Op{l) [3, Corollary A.l], Remark A.l 
and (n — 1)~^Z^l = Ir to simplify Lemma C.l(e) of Bai and Li 
[4]. To prove (A.5), notice that under our conditions (or the IC3 condi¬ 
tion of Bai and Li [3]), the left hand side of (A. 13) in Bai and Li [3] 
is actually 0 as the terms Mff and MJj- in their notation are exactly 
Ir- Also, iTF = Ir + Op{l) from Bai and Li [3, Corollary A.l]. 
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Thus, (A.5) holds by applying Lemma C.l of Bai and Li [4]. As a conse¬ 
quence, by applying Bai and Li [4, Lemma C.l], (A.5) and (A.6), we now 
have maxj \b]^j\ = Op(n“^/^) for A: 7^ 8,10 and maxj \akj\ = for 

A: / 1, 2, 8, 9,10. Using independence of the noise, it’s also easy to see that 
maxj IbgjI = Op{\/\ogp/n) and maxj \akj\ = Op(y^logp/n) for A: = 1,10. 

Next, we show the following facts under the condition that p/ri^ —)• 0 when 
p —)• 00 for some k > 0. Let ieti)[n-i}xp = denote a random matrix 

whose entries are then i.i.d. N(0,1) variables. Then for each s = 1, 2, • • • , r. 


n—1 


(A.7) 


max - -^ 

j=i, 2 ,-,p [n - l)p 


- E(etjetj)] = Op{n and 


i=l 


t=l 


(A.8) .^max _ ^ - E{etietj)]j =Op{n-^/^). 

To prove (A.7), we only need to show maxj | Yt=i = 

Op(n“^/^) as the remaining term is Op(n“^/^) because of the independence. 
This approximation is proven by the union bound and boundedness of T: 
for Ve > 0 


n—1 


2 _ 

lim P (^/n max ^lEE 


,p^oo \ ,p (n — l)p 


t=l 


< lim 2p ■ P 

n,p—>00 


y/nP 
{n — l)p 


n—1 


EE 


i^l t=l 
n—1 . 

= lim 2p-P( > 

n,p^oo \ n — 1 VP — 1 


i=l 


e p 
D y/p- I 


< lim 2p ■ E 

n,p^oo 


n—1 - 

n / 1 


n 




e P 
D y/p-I 


= 0 


To see why the last equality holds, {p — 1)”^'^^ ~ N(0,1) is in¬ 

dependent from eti, thus the fourth moment of {n — 1)“^/^ YJtZi Pi{{p ~ 
1)“^/^ Yi^i Pi) is bounded which enables us to use the Markov inequality. 
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lim P( max 

i,p^oo \ ,p [n — IYp 


To prove (A. 8), we start with the same union bound as for (A. 7), 

1 n—l 

> e 

t=i 

p n—l 

\2 ^ 

i=2 t=l 
n—l 

Y et2eti > Ve 


p n-l 

< lim P-P(-, - ^'^{'^euetiY > 

n,p^oo \[n — l)^p 

1 


< lim 2p ■ P 

n,p^oo 


n — 


t=l 

n—l 


< lim 2p ■ E 

n,p^oo 


^ IL J. 

(—T 

t=i 


et2eti ) 


iki 


/e 


2k 


< lim 2C/e2^ • (//n^^) = 0 

n,p—>oo ' ' 


where C is some positive constant. The second last inequality is due to 
Markov inequality and last inequality holds as etjCti, t = 1, 2, • • • n — 1 are 
independent and have finite moments of any order. The last limit holds when 
we assume p/rY —)• 0. 

Equation (A.7) directly implies that 


max 
j=ir-- ,p 


«{E 


1 

(7 idj 


1 


n — l 


t=2 


= Op{n 


as H = Op{p ^). Using (A.8) and p 11^; — = Op{n from 

Lemma 3.1, we get by using the Cauchy-Schwartz inequality: 


max 


«{E 


1 

CiCTj 


(r. 


dO)^ 


n — l 


Y,[EuE, 




t=2 


E{EuEtj)]) 


Op{n 


Similarly, combining with Remark A.l, we get 


max 


«(E 


1 

(JiCj 



1 

Ej)^^Y.^EuEt, -EiEuEtj)]) 

^ t=2 


Op{n 


By writing f j = Ej + E^^^ — Ej + f ^ — E® and using boundedness of both 
dj and Uj, 


(A.9) 


max 


«(E 


a: 


Lf-^ 

2 ^n-l 


'^\EtiEtj — E{EtiEtj)\) 

t=2 


= Op{n 
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which indicates that maxj |a9j| = 

To bound the remaining terms, we use the fact that maxj=i^... |rj| = 
Op(l). To see this, first notice that because of boundedness of dj and aj 
and the fact that H = Op{p~^), we have maXj |6ioj| = Op(p~^ maxj |rj|). 
Combining the previous results, we have maxj |f j — T® | = Op(y^log p/n) + 
Op(maxj iTjl) which indicates that maXj iFjl = Op{l). Thus, maxj logjl = 
Op(maxj \^‘j ~ ^‘]\) is negligible and maxj |f^ = Op{^/\ogpJn) + 

Op(maxj l&j — u||). The latter conclusion also indicates that maXj |a2j| = 
Op{^ylogp/n) + Op(maxj |(t| — cj||). As a consequence, the second claim in 
(A.2) holds. 

Finally, To prove (A.3), we actually have already shown that maXj iFj — 
rf - bsj\ = Op(n-V2). Then, 


max 


■p _ p(0) _ 


Y^zf^E, 


n — 1 


i=2 


< max 

,p 


■p p(0) L 

i j - i . - Osi 


+ max 

,p 


1 

i=2 

n 

<Op(n-i/2) + ||iTf^5]-i(f-F(0))||p max - 

j=i,2,-,p n - 1 ^ * 


=Op{n 


Thus, (A.3) holds. 


□ 


A.2. Proof of Theorem 3.1. First, note that by the strong law of 
large numbers n“^/^||X||2 = sjn~^ XILi i-> 

RR^ = (n- “ 4 ' Ir. 

Indeed one can show that R 4 T by applying Bai and Li [3, Lemma 
A.l]. We proceed to prove our theorem by showing the conclusion holds for 
any fixed u and fixed sequences and p=i such that 

II ||2/-v/n —)• 1 and R^^’P'i —)• T as n,p —)• oo. For brevity we will write 

X and R instead of and R^'^’P') for the rest of this proof. 
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Plugging (3.4) in the estimator (3.5) and (3.6), we obtain 

Vn(/3^? - /3_c) =j^{El_c - f _c(f?Sc'rc)-'f 
11^ l|2 

+Vn • (r® - f _c)a(°) 

+v^ • f _c(f 'f’c)”^rc '(re - 

As n,p —)> oo, y^/||JA ||2 1. Also, as p/n^ — )■ 0 for some A: > 0, using 

Lemma A.l and Remark A.l, both 'S and F has entrywise uniform conver¬ 
gence in probability to S and F. Using Assumption 4, we get 

(PUETfc)"=(iLrJi^Trc)" + o.{i) 

(A.io) + e>p(i) 

pf^Sc'(Vn(rc -rf)) =pF?Sci(Vn(fc -F®)) +Op(l) 

which implies 
(A.ll) 

- Ps) =Els - Vs{Vl^0Tc)-^TlT,0Elc 
+Vn • (F® - f 5)a(°) 

+V^ • F5(F?S-iFc)-^Fj5]cnfc - Fg’))a(o) + Op(l). 

Note that Ei X f, Ei^ X Ei^S: \/n(f 5 — F^^) -A N(0, S 5 (8) Ir), the 

four main terms on the right hand side of (A.ll) are (asymptotically) uncor¬ 
related, so we only need to work out their individual variances. Since Ef ~ 
N(0,5]), we have Efg ~ N(0, S5) and Ts{T^'E0Tc)-^T'^'E0EIc ~ 

N(0, A 5 ). Similarly, ^/n ■ (F^°^ — f 5 )q:(°) -a N(0, HqIPS^), and 

^/^ • F5(F^Sc^Fc)-^Fj5]c^(fc - Ff )a(0) 4 N(0, ||af Ag). 


A.3. Proof of Theorem 3.2. As in the proof of Theorem 3.1, we 
prove the conclusions in this theorem for any fixed VFi and fixed sequences 
and such that ||X(^)|| 2 /V^ ^ 1 and R 

as n,p ^ 00. For brevity we will write X and R instead of and R^'^’P') 
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for the rest of this proof. We abbreviate as a in this proof. To avoid 
confusion, we use a for the true value of the parameter and ci; to represent 
a vector in M^. 

Because —>■ a, we prove this theorem by showing that for any e > 0, 

P(||q: — > e) —t 0. We break down our proof to two key results: First, 

we show a and are close in the following sense 

rni 1 f TJ — a)\ 

(A.12) - q) = - ^ p - = Op(l), 

^ j=i \ J 

and second, we show that for sufficiently small e > 0, there exists r > 0 such 
that as n,p —>■ oo 


(A.13) 


P 



ip^di) > r 


^ 1 . 


Based on these two results and the observation that 


Woe 


( 0 ) _ 


a 


I2 < e} D - q) < r| Pi 


l|2>e 


we conclude that P(||q; — > e) —)• 0. 

Let’s start with (A.12). Denote lp{a) = p~^ Z]j=i P . 

By (3.9), we have = argminZp(Q:), so Ipia) < Zp(a(°)). We examine 
the difference between lp{dL) and — a) for any a, starting from 


lp{(^) 


1^ /^yl,/||X||2-fJa^^ 

^-j 

1 ' / ft + (rff Q<°) + Eijl\\x\\2 - tja 

pp,'‘\ V' 


Because p has bounded derivative, |/?(a:) — p{y)\ < D\x — y\ for any x, y G M. 
In the statement of Theorem 3.2 we assume ||/3||i/p —)• 0. This together with 
IIXII2 —)• 0 implies that 


^ i=i 


1 P /(rf))^a(0)-ffa' 


+ Op{l). 


(r(0))T^(0)_fJa f,(a(0)-a) 


(r® - f 



^3 


Next 
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Therefore, by the same argument as before, 

(A.14) = +o,(l) = v^(«(0)-a) + Op(l). 

pU V / 


Also, <^(0) = 0 because /9(0) = 0. Therefore lp{a) < lp(a^^^) = Op(l). Notice 
that the Op(l) term in (A.14) does not depend on a, hence — a) = 

^p(^) T ®p(^) — 

Next we prove (A. 13). Since p{x) is non-decreasing when x > 0, 


inf -Vp 
l|a||2>eP “ 
J=1 


If p/n^ —7- 0 for some k > 0, then using Lemma A.l, there exists some 
constant D* that P(maxj ||rj||2 < D*) —)• 1. Thus when maxj ||rj||2 < D* 
holds, there is sufficiently small e > 0, the a on the right hand side is within 
the neighborhood where p is strongly convex in Assumption 5, so for some 

K > 0 


inf - ^ P 

« 2=eP “ 
J=1 



> 


jnf 

a\\2=e 


K ■ - 

P 


1 ^ 







By the uniform consistency of F and S using Lemma A.l, we conclude 
(A.13) is true for r = Ke^Amin(r^5]~^r)/2, where Amin(r^S~^r) > 0 by 
Assumption 3. 


A.4. Proof of Theorem 3.3. Because is consistent, we can ap¬ 
proximate the left hand side of (3.12) by its second order Taylor expansion 
(we abbreviate ^^ f s i'® i^ ii' causes no confusion): 

0 = ’4'p(Q:i°i) V’4'p(Q!i°i) • (q^^ - ai^i) Vp 

where Vp is the higher order term and Assumption 6 implies Vp = Op(||Q;^^ — 
q:|| 2). Therefore — [VT'p(Q:i°i) -|- Op(l)] ^ Tfp(Q:i‘’i) and 


(A.15) 


^(/jRR _ p) + \/^(r(oi - f )q 


RR 


,1^112 

+ f 


1 -1 


VT'p(a(°))+Op(l) ^/n’^^p(a(°)) 


It’s easy to show {y/n/\\X\\2)Ei^s + “ f5)Q:^^ -A- N(0, (I -|- 

q;|P)5 ]5) by Lemma A.l. Therefore the proof of Theorem 3.3 is completed 
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once we can show the largest eigenvalue of ^ is Op(l) and 

A 0. We prove these two facts in the following lemma: 

Lemma A.2. Under the assumptions and limits in Theorem 3.3, the 
largest eigenvalue of the matrix [V^p(Q:(°^)] is bounded in probability 
and A 0 . 

Proof. By using the representation of F in (A.4), we have 

Wi,/||X||2-fJa(0)~ 




i=i 


(Ji 


) A/A 




lP ,^j + Eij/WXh + 

2^^ 


1A , /Pi + AFIAlb - + i, 


A/A 


p 


i=i 


(T 1 + 5j 


A/A 


where maxj |<5j| = Op(l) and maxj |ej| = Op{n from Lemma A.l. Be¬ 
cause ||/3||i\/™/p —)• 0 and ip' is bounded, 




1 ^ + e, 


p 


i=i 


a j + 5j 


'^tj/aj+Op{n 


Let gj be the expression inside ip in the last equation omitting ej and 5j. 
Conditionally on Z^l , the variables gj , where j = 1 ,..., p are independent 
and identically distributed with E{gj) = 0 and gj = Op{n~^^‘^). Thus, using 
Assumption 6 and boundedness of aj, 


E H 


L-i 


9j + 


^j9j 


(Ti 


-'^(sj) tj/aj 


<D‘^ ■ 


Yj (IC'llAl + IdjW^ji'jl)/^j — Op{n ! ) 

^ i=i 


We can further use the facts that ip{gj) = 'ip'{0)gj+Op{n = Op{n 
and ipisj) — ip'{0)gj are i.i.d., and combine Remark A.l and Lemma A.l to 
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get: 


^p(q:W)|| = (gi+ +Op{n 

I p \ J ^ 

\'^'^i9j)^j/^j +Op{n-^/‘^) 


i=i 


-(0) 


1 ^ r^. 

^ + Op(n-^/ 2 ) 


V 

J=1 




1 p(0) 

+Op(n-^/ 2 ) 

^ i=i ^ 


= Op{n 2) + Op(n 2) = Op(n 2 ) 

Similarly, because limp^oo exists and is positive definite (in 

Assumption 3), we use Assumption 6 and the uniform convergence of S and 
F in Lemma A.l to get 




1 -1 


^ {93 + ) rj-rJ/'^j + op(i) 


.=1 


fj,- 


■3^J 1^3 ^ 

M 


(0)FfFf%| + Op(l) 


.=1 


V^'W-Er^rJ/a,^ 


1 -1 


p 


3=3 


This means that all the eigenvalues of [VT^p(q:^*^^)] ^ converge to finite 
constants. □ 


A.5. Proof of Theorem 3.4. We begin with a lemma regarding the 
test statistics. 

Lemma A. 3. The test statistics (3.13) can be written as tj = zj +Vj for 
j G Mp, where Zj,j G A/), are independent standard normal variables and 
Vj,j G Mp satisfy maxi<j<p |uj| = Op(l). 

Proof. We first prove this lemma for the RR estimator and the 
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corresponding test statistics. Let 


^ crj Y^l + ||q:P 

It is easy to verify that Zj i.i.d. ~ N(0,1), j = 1,... ,p. By using the expres¬ 
sion of in (A. 15), we can show for j ^ M (so /3j = 0), that 


max 

ieWp 


= max 
j&K 


= max 
ieWp 

= Op(l), 


\X\\2^j - cTjy/l + WaW^Zj 
1 


\X\\2^j- 




1 

|X||2(rf - 


+ Vi 


where rj = tj + Op(l)] ^ ||X||2’J'p(a:*^‘^)). The last step is due 

to the uniform convergence of Tj in (A.3) and Lemma A.2 to uniformly 
control rj. Now we can show, by using the uniform convergence rate of 
that 


max \ Vj\ = max 
isA/p jeAfp 


= max 

j&Xp 






iix 112/3,•-d.-yrr 

\\a\\^Zj 


/\/l + 


ct 


=Op ( max ||X||2/3 j — aj\/T+~^&^Zj ) 
\jeXp J 

<Op ( max ||X||2/3,- - cr,-yr+ljctjpz,- ) 
Vie A/p / 


+ Or, ( max 


( max aj-^1 + — crj-^1 + HaP^j ] 

VieAip J 


= Op{l). 


For the negative control estimator, the same argument holds by noticing the 
Op(l) term in (A.11) is also uniform over j (similar to rj). □ 


To prove the first conclusion in Theorem 3.4, we show the left hand side 
of (3.14) has expectation converging to a and variance converging to zero. 
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For the expectation, for any e > 0, 

^ > V 2 ) ^ ^ > V2 - e) + P(l^'il > e) 

' jeJVp ' P' jeJVp 

= 2(1 - ^{za/2 - e)) + 1 ^ Y1 

isAT, 

< 2(1 - ^{za /2 - e)) + P( max \vj\ > e) 

i<j<p 

^ 2(1 - ^{z ^/2 - e)). 

Similarly, one can prove lim„ ,p^oo |7i^ ZljeMp > ^ 0 / 2 ) ^ 2(1-$(2;o/2 + 

e)) for any e > 0. Thus the expectation converges to a when n,p —>• 00 . 

For the variance, we compute the second moment of the left hand side of 
(3.14): for any e > 0, 


TT^ X] > ^c./2,\tk\ > Z^/ 2 ) 

' JAeATp 

X] > V 2 ) + TT^ X] > ^“/2’ > ^«/2) 


K'2 




j,keAfp,jf!=k 


< 




2(1 - 4>(2;„/2 - e)) + P( inax \vj\ > e) 

i<j<p 


pi L 

+ Ia7U2 Pi\Zj\ > Z^/2- €,\Zk\ > Z„/2- 

P j,k£Mp,jj^k 

+ P(|uj| > e) + P(|ufc| > e) 

'■Tj^ Y ^(\z:j\> z:a/ 2 -^,\zk\> z^/ 2 -e)+o{l) 
P j,keJVp,j^k 

JAApI -1, 


-[2(1 - 4>(2 ;q,/2 - e))]^ + 0 ( 1 ) 
^ 4[1 - 4 >( 2;„/2 - e )]2 


lAApI 


Similarly we can prove the lower bound of the second moment. In conclusion, 
the second moment converges to a^, hence the variance of (3.14) converges 
to 0. 
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To prove the second conclusion in Theorem 3.4, we begin with 





=P ( max I Zj + Vi 


>4>-i(l-a/(2p))) 


<P( max Izd > 4> ^(1 — a/(2p)) — max \vj\] 
VieWp ieWp / 

<P( max \zj\ > — Q;/(2p)) — max |u,|) 

i<i<p jeWp / 


The conclusion ( 3 . 15 ) then follows from P^ maxi<j<p \zj\ > ^(1—a/ 

a (the validity of Bonferroni for i.i.d. normals), the fact that < 1>“^(1 — 


a/{2p)) —)• oo as p —)■ oo, and the result in Lemma A.3 that maxjg_/v'p \ vj 

Op{l). 


A.6 . Proof of Theorem 3.5. First, we point out that when a = 0, 
as n, p —)• oo 

(A.16) y/n-a^°'> = R-^Wi-^N{0,lr) 

where R and Wi are defined in Section 3.2. This is due to the fact 

that R^ Ir (Remark A.l) and \fnW\ ~ N(0, 7^), thus Slutsky’s Theorem 
implies (A.16). Next, we show that 

(A.17) ■ {a — = Op{l) 

For the negative control scenario, using the expression of 6;^*^ in (3.5) 
and Ti^c/||A1||2 in (3.4), we get 

11^ l|2 

Using the facts we got in (A.10) and = Op(l), we further get 

- aW) = {TcJ^c^Tcr^Tc'^c^Efc + Op{l)- 

Under Assumption 4, if \C\ —?• oo, the maximum eigenvalue of (rcS^f^Ff^)"^ 
goes to 0, thus (A. 17) holds for the negative control scenario. 
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For the sparsity scenario, in the proof of Theorem 3.3, we have shown 
that 


1 -1 


-v/n(Q:^^ — = —-v/« + Op(l) 

Thus, because of Lemma A.2, (A.17) also holds for the sparsity scenario. 
Finally, combining (A. 16) and (A. 17), Theorem 3.5 holds. 


A.7. Proof of Lemma 4.1. First, note that by the strong law of large 
numbers ^ (Xq Xi) (Xq Xi) Hx- Using the QR decomposition 

of (Xq Xi) = QU and writing U = ^ ~ 

clear that Sx- Since Hx is nonsingular, both t/oo and Un are 

full rank square matrices with probability 1. Thus using the block matrix 

inversion formula, we have V~^ = ( t?— i ) where * represents some 

do X do or do x di matrix. Therefore the right bottom block of nV~^V~'^ is 
nil II and converges to fin almost surely. 

A.8 . Proof of Theorem 4.1. First, for the known zero indices sce- 
^ NC 

nario, A^ has the following formula, which is similar to (3.5): 

(A.18) 

which implies a similar formula as (A.ll): 

(A.19) 

- Bi,5) • Ts{TlYc^Tc)-^TlY^^ElcU^,^ 

+Vn-(rS’^-rs)A® 

• rs(r?5ic'rc)“'r^Sc'(fc - rf )a® + op(i), 

where A^°^ = Following the proof of Theorem 3.1 by 

using Lemma 4.1, we get (4.6). 

For the unknown zero indices scenario. Lemma 4.1 guarantees the con- 
sistency of each column of A^ by using Theorem 3.2. Then the Taylor 
expansion used in the proof of Theorem 3.3 still works at each column of 
a[°\ Similar to (A. 15), we get 


(A.20) 




Bi) =^E^U^^ + \/^(r(0) - f)Af^ 

+ f(fl'i 92 ••• Qdi) 
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where Qi = V^'p(A®) + Op(l)). Following the proof of 

Theorem 3.3, we get each Qi = Op(l). Thus 

- Bi) = • (r(0) - f )Af^ + Op(l) 

and (4.7) holds. 

APPENDIX B: SUPPLEMENTARY EIGURES AND TABLES 


r 

mean 

median 

sd 

mad 

skewness 

medc. 

#sig. 

X/Y top 100 

p-value 

0 

0.077 

0.14 

1.25 

1.04 

-0.949 

-0.064 

605 

97 

58 

NA 

1 

0.19 

0.21 

1.37 

1.2 

-0.556 

-0.02 

458 

90 

72 

0.013 

2 

0.15 

0.19 

1.41 

1.23 

-0.464 

-0.027 

457 

91 

74 

0.039 

3 

0.015 

0.055 

1.38 

1.18 

-0.442 

-0.035 

509 

97 

75 

0.00096 

4 

0.045 

0.065 

1.27 

1.03 

-0.661 

-0.018 

608 

101 

78 

5.2e-07 

5 

0.044 

0.067 

1.3 

1.06 

-0.573 

-0.019 

612 

100 

76 

1.8e-06 

7 

0.071 

0.088 

1.34 

1.11 

-0.527 

-0.0097 

572 

99 

76 

2.7e-06 

10 

0.024 

0.057 

1.39 

1.15 

-0.539 

-0.025 

658 

100 

75 

5.3e-07 

15 

0.097 

0.12 

1.48 

1.23 

-0.619 

-0.018 

659 

102 

76 

2.4e-08 

20 

0.051 

0.072 

1.48 

1.24 

-0.628 

-0.015 

625 

102 

76 

6.5e-ll 

30 

0.021 

0.038 

1.58 

1.28 

-0.709 

-0.01 

748 

109 

81 

5.6e-13 

33 

0.032 

0.052 

1.63 

1.33 

-0.669 

-0.013 

751 

109 

79 

2.7e-12 

40 

0.027 

0.052 

1.75 

1.44 

-0.544 

-0.017 

846 

111 

78 

6.3e-ll 

50 

0.034 

0.054 

1.93 

1.59 

-0.389 

-0.0088 

954 

117 

76 

1.3e-09 


Table 3 

Supplementary Dataset (n = 143, p = 54675). This is same as dataset 1 except the 
primary variable is gender instead of COPD severity. 
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6Type I error |± Power liFDP EToplOO 


Fig 4: Compare the performance of nine different approaches when the vari¬ 
ance of X explained by the confounding factors is 5%. The error bars are one 
standard deviation over 100 repeated simulations. The three dashed horizon¬ 
tal lines from bottom to top are the nominal significance level, FDR level, 
oracle power and the precision of the smallest 100 p-values, respectively. 
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6Type I error |± Power liFDP EToplOO 


Fig 5: Compare the performance of nine different approaches when latent 
factors are unconfounding. The error bars are one standard deviation over 
100 repeated simulations. The three dashed horizontal lines from bottom 
to top are the nominal significance level, FDR level, oracle power and the 
precision of the smallest 100 p-values, respectively. 






























CONFOUNDER ADJUSTMENT 


49 


[8] Bai, J. and Ng, S. (2006). Confidence Intervals for Diffusion Index Forecasts and 
Inference for Factor-Augmented Regressions. Econometrica 74 1133-1150. 

[9] Benjamin:, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a 
practical and powerful approach to multiple testing. Journal of the Royal Statistical 
Society. Series B (Methodological) 51 289-300. 

[10] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in 
multiple testing under dependency. Annals of statistics 29 1165-1188. 

[11] Blalock, E. M., Geddes, J. W., Chen, K. C., Porter, N. M., Markes- 
BERY, W. R. and Landfield, P. W. (2004). Incipient Alzheimer’s disease: microar¬ 
ray correlation analyses reveal major transcriptional and tumor suppressor responses. 
Proceedings of the National Aeademy of Sciences of the United States of Ameriea 101 
2173-2178. 

[12] Bollen, K. a. (1989). Struetural equations with latent variables. John Wiley & Sons. 

[13] Brys, G., Hubert, M. and Struyf, A. (2004). A robust measure of skewness. 
Journal of Computational and Graphieal Statistics 13 996-1017. 

[14] Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S. (2012). Latent variable 
graphical model selection via convex optimization. Ann. Statist. 40 1935-1967. 

[15] Clarke, S. and Hall, P. (2009). Robustness of multiple testing procedures against 
dependence. The Annals of Statisties 37 332-358. 

[16] Craig, A., Cloarec, O., Holmes, E., Nicholson, J. K. and Lindon, J. C. (2006). 
Scaling and normalization effects in NMR spectroscopic metabonomic data sets. An¬ 
alytical Chemistry 78 2262-2267. 

[17] De La Euente, A., Bing, N., Hoeschele, I. and Mendes, P. (2004). Discov¬ 
ery of meaningful associations in genomic data using partial correlation coefficients. 
Bioinformatics 20 3565-3574. 

[18] Desai, K. H. and Storey, J. D. (2012). Cross-dimensional inference of dependent 
high-dimensional data. Journal of the American Statistical Association 107 135-151. 

[19] Efron, B. (2007). Correlation and large-scale simultaneous significance testing. Jour¬ 
nal of the Ameriean Statistieal Association 102 93-103. 

[20] Efron, B. (2010). Correlated z-values and the accuracy of large-scale statistical 
estimates. Journal of the American Statistical Association 105 1042-1055. 

[21] Fan, j., Han, X. and Gu, W. (2012). Estimating false discovery proportion under 
arbitrary covariance dependence. Journal of the American Statistical Association 107 
1019-1035. 

[22] Fan, j. and Han, X. (2013). Estimation of false discovery proportion with unknown 
dependence. arXiv:1305.7007. 

[23] Fare, T. L., Coffey, E. M., Dai, H., He, Y. D., Kessler, D. A., Kilian, K. A., 
Kogh, j. E., LeProust, E., Marton, M. J., Meyer, M. R. et al. (2003). Effects of 
atmospheric ozone on microarray data quality. Analytical chemistry 75 4672-4675. 

[24] Fisher, R. A. (1935). The design of experiments. Oliver & Boyd. 

[25] Friguet, C., Kloareg, M. and Causeur, D. (2009). A factor model approach to 
multiple testing under dependence. Journal of the American Statistical Association 
104 1406-1415. 

[26] Gagnon-Bartsch, J., Jacob, L. and Speed, T. (2013). Removing unwanted vari¬ 
ation from high dimensional data with negative controls Technical Report, Technical 
Report 820, Department of Statistics, University of California, Berkeley. 

[27] Gagnon-Bartsch, J. A. and Speed, T. P. (2012). Using control genes to correct 
for unwanted variation in microarray data. Biostatistics 13 539-552. 

[28] Gasch, a. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., 
Storz, G., Botstein, D. and Brown, P. O. (2000). Genomic expression programs 



50 


J. WANG ET AL. 


in the response of yeast cells to environmental changes. Molecular biology of the cell 
11 4241-4257. 

[29] Greenland, S., Robins, J. M. and Pearl, J. (1999). Confounding and collapsibility 
in causal inference. Statistical Science 14 29-46. 

[30] Grzebyk, M., Wild, P. and Chouaniere, D. (2004). On identification of multi¬ 
factor models with correlated residuals. Biometrika 91 141-151. 

[31] Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonel- 
Lis, K. J., SCHERF, U., Speed, T. P. et al. (2003). Exploration, normalization, 
and summaries of high density oligonucleotide array probe level data. Biostatistics 4 
249-264. 

[32] Jin, J. (2012). Comment. Journal of the American Statistical Association 107 1042- 
1045. 

[33] Kish, L. (1959). Some statistical problems in research design. American Sociological 
Review 24 328-338. 

[34] Korn, E. L., Troendle, J. F., McShane, L. M. and Simon, R. (2004). Controlling 
the number of false discoveries: application to high-dimensional genomic data. Journal 
of Statistical Planning and Inference 124 379-398. 

[35] Kuroki, M. and Pearl, J. (2014). Measurement Bias and Effect Restoration in 
Causal Inference. Biometrika 101 423-437. 

[36] Lan, W. and Du, L. (2014). A Factor-Adjusted Multiple Testing Procedure with 
Application to Mutual Fund Selection. arXiv:1407.5515. 

[37] Lazar, C., Meganck, S., Taminau, J., Steenhoff, D., Coletta, A., Molter, C., 
Weiss-Soli's, D. Y., Duque, R., Bersini, H. and Nowe, A. (2013). Batch effect 
removal methods for microarray gene expression data integration: a survey. Briefings 
in bioinformatics 14 469-490. 

[38] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression 
studies by surrogate variable analysis. PLoS genetics 3 1724-1735. 

[39] Leek, J. T. and Storey, J. D. (2008). A general framework for multiple testing 
dependence. Proceedings of the National Academy of Sciences 105 18718-18723. 

[40] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., John¬ 
son, W. E., Geman, D., Baggerly, K. and Irizarry, R. A. (2010). Tackling 
the widespread and critical impact of batch effects in high-throughput data. Nature 
Reviews Genetics 11 733-739. 

[41] Li, j. and Zhong, P.-S. (2014). A rate optimal procedure for sparse signal recovery 
under dependence. arXiv preprint arXiv: 1410.2839. 

[42] Lin, D. W., Coleman, I. M., Hawley, S., Huang, C. Y., Dumpit, R., Gif¬ 
ford, D., Kezele, P., Hung, H., Knudsen, B. S., Kristal, A. R. et al. (2006). 
Influence of surgical manipulation on prostate gene expression: implications for molec¬ 
ular correlates of treatment effects and disease prognosis. Journal of clinical oncology 
24 3763-3770. 

[43] Maronna, R. a., Martin, D. R. and Yohai, V. J. (2006). Robust statistics: Theory 
and Methods. John Wiley & Sons, Chichester. 

[44] Onatski, a. (2010). Determining the number of factors from empirical distribution 
of eigenvalues. The Review of Economics and Statistics 92 1004-1016. 

[45] Owen, A. B. (2005). Variance of the number of false discoveries. Journal of the Royal 
Statistical Society: Series B (Statistical Methodology) 67 411-426. 

[46] Owen, A. B. and Wang, J. (2016). Bi-cross-validation for factor analysis. Statistical 
Science (to appear). 

[47] Pearl, J. (2009). Causality: models, reasoning and inference. Cambridge Univ Press. 

[48] Perry, P. O. and Pillai, N. S. (2013). Degrees of freedom for combining regression 



CONFOUNDER ADJUSTMENT 


51 


with factor analysis. arXiv preprint arXiv:1310.7269. 

[49] Pesaran, M. H. (2004). General Diagnostic Tests for Cross Section Dependence in 
Panels Cambridge Working Papers in Economics No. 0435, Faculty of Economics, 
University of Cambridge. 

[50] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., 
Shadick, N. a. and Reich, D. (2006). Principal components analysis corrects for 
stratification in genome-wide association studies. Nature genetics 38 904-909. 

[51] Ransohoff, D. F. (2005). Bias as a threat to the validity of cancer molecular-marker 
research. Nature Reviews Cancer 5 142-149. 

[52] Rhodes, D. R. and Chinnaiyan, A. M. (2005). Integrative analysis of the cancer 
transcriptome. Nature genetics 37 S31-S37. 

[53] SCHWARTZMAN, A. (2010). Comment. Journal of the American Statistical Association 
105 1059-1063. 

[54] SCHWARTZMAN, A., DOUGHERTY, R. F. and Taylor, J. E. (2008). False discovery 
rate analysis of brain diffusion direction maps. The Annals of Applied Statistics 2 
153-175. 

[55] She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized re¬ 
gression. Journal of the American Statistical Association 106 626-639. 

[56] Singh, D., Fox, S. M., Tal-Singer, R., Plumb, J., Bates, S., Broad, P., Ri¬ 
ley, J. H. and Celli, B. (2011). Induced sputum genes associated with spirometric 
and radiological disease severity in COPD ex-smokers. Thorax 66 489-495. 

[57] Storey, J. D., Taylor, J. E. and Siegmund, D. (2004). Strong control, conser¬ 
vative point estimation and simultaneous conservative consistency of false discovery 
rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical 
Methodology) 66 187-205. 

[58] Sun, Y. (2011). On latent systemic effects in multiple hypotheses PhD thesis, Stan¬ 
ford University. 

[59] Sun, W. and Cai, T. (2009). Large-scale multiple testing under dependence. Journal 
of the Royal Statistical Society: Series B (Statistical Methodology) 71 393-424. 

[60] Sun, Y., Zhang, N. R. and Owen, A. B. (2012). Multiple hypothesis testing ad¬ 
justed for latent variables, with an application to the agemap gene expression data. 
The Annals of Applied Statistics 6 1664-1688. 

[61] Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microar¬ 
rays applied to the ionizing radiation response. Proceedings of the National Academy 
of Sciences 98 5116-5121. 

[62] Vawter, M. P., Evans, S., Choudary, P., Tomita, H., Meador-Woodruff, J., 
Molnar, M., Li, J., Lopez, J. F., Myers, R., Cox, D. et al. (2004). Gender-specific 
gene expression in post-mortem human brain: localization to sex chromosomes. Neu¬ 
ropsychopharmacology 29 373-384. 

[63] Wang, S., Cui, G. and Li, K. (2015). Factor-augmented regression models with 
structural change. Economics Letters 130 124-127. 

[64] Wang, J., Zhao, Q., Hastie, T. and Owen, A. B. (2015). Supplement to ’’Con- 
founder Adjustment in Multiple Hypothesis Testing”. 

[65] Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for 
regression. The Annals of Statistics 642-656. 



52 


J. WANG ET AL. 


Department of Statistics 
Stanford University 
390 Serra Mall 
Stanford, California 94305 
USA 

E-mail: jingshuw@stanford.edu 
E-mail: qyzhao@stanford.edu 
E-mail: hastie@stanford.edu 
E-mail: owen@stanford.edu 


